Skip to content

Conversation

mosullivan93
Copy link
Contributor

@mosullivan93 mosullivan93 commented Mar 10, 2023

This PR contributes partial sorting algorithms (i.e. sort only as much as is required) for both a single index and only the first k indices. Closes #10.

  1. The function to place a single element (the kth smallest) in its sorted position (and partition the array around it) is an implementation of the QuickSelect, and has been named avx512_qselect. This is analogous to std::nth_element (where nth would be arr.begin() + k).
  2. The function to partially sort the array (and place the first k elements in their sorted position at the front of the array) mirrors std::partial_sort (where middle would be arr.begin() + k). This function is avx512_partial_qsort.

Additional changes:

  1. Fixed a small issue with the Makefile (cleaning up benchdir) and the AVX512FP16 check for Meson.
  2. Changed some comments around the partition functions in the avx512-common-qsort.h to reflect my interpretation of their functioning. I have not updated the relevant comments in the kv files.

Edit: Corrected description of relationship between avx512_qselect and std::nth_element. This had changed during development but I forgot to update here.

@mosullivan93
Copy link
Contributor Author

mosullivan93 commented Mar 11, 2023

Tests have been added for the partial sort functions.

I am not sure why, but having -O3 causes some tests to fail (not just the tests related to partial sorting). Diving into that is beyond my abilities.

Edit 2023-04-10:
The optimisation level doesn't affect the tests in the same way as it did previously. Never got into why there was a problem there in the first place, though, but the code has changed substantially since.

@r-devulap
Copy link
Member

Thank you for your contribution! Apologies for the merge conflicts. I need sometime to review the code and the new tests. But I like the idea of supporting partial sort. Once we fix the tests and ensure it passes, we will also need benchmarks to make sure this provides the perf benefits. Please refrain from adding any benchmarks just yet, I am considering using google benchmarks rather than writing the whole thing myself.

@r-devulap
Copy link
Member

Is there a high level benchmark or a downstream project that can benefit from this patch? Also pinging @WilliamTambellini who originally opened #13

@WilliamTambellini
Copy link
Contributor

Hi @r-devulap
Cool.
A first baseline bench would be ofcourse to compare with std::partial_sort.
Typical usage in deep learning is an input vector of several dozen of thousands of float32 and a quite small value of k (2, 4, 8,...).
Then if good enough, perhaps worth to compare with aten topk:
https://pytorch.org/cppdocs/api/function_namespaceat_1a5fdb33147326ece7b1b11e0073477315.html?highlight=topk

@mosullivan93
Copy link
Contributor Author

Based off of #10, I was planning to use std::partial_sort for the range sort method benchmarks. I believe it could also be used for the single index approach, but I'm not sure if there's another suitable function for that comparison.

@mosullivan93
Copy link
Contributor Author

I've treated the parameter k as if it were a 1-based index, if you think it's more appropriate in C++ to have it be an offset for the array then I could tweak the code. This would potentially be more consistent as it would align with std::partial_sort and how left and right work for the original functions.

@r-devulap
Copy link
Member

Hi could you please rebase with main? I will spend some time on this next week.

@mosullivan93
Copy link
Contributor Author

I'll have it done by early next week. Currently away from home.

@r-devulap
Copy link
Member

I'll have it done by early next week. Currently away from home.

No rush, take your own time.

@mosullivan93
Copy link
Contributor Author

This is taking a little longer than I expected. I'll pick this up again over the weekend and add partial sorting for the _Float16 type, too. Apologies for the delay.

@mosullivan93
Copy link
Contributor Author

mosullivan93 commented Apr 2, 2023

I've finished rebasing onto the latest changes. I reorganised some of the code I added to better fit the new layout (avx512-common-qsort.h) but haven't changed any of the logic otherwise.

Edit: _Float16 functions are now implemented in the below commit. The tests for partial k are very slow, especially so for the emulated AVX512FP16. A simpler test may be sufficient, it's currently testing k = 1..n for n up to 1024.

@r-devulap
Copy link
Member

@mosullivan93 Do I mark this as ready for review or is it still WIP?

Each datatype now supports two partial sorting algorithms:
1) Sort such that a particular index is valid (QuickSelect), and
2) Sort such that the first k indices is valid (PartialQuickSort),
where 'valid' means that the elements are in the same position as if the
entire array had been sorted.

Additionally transferred a few lingering comments from a refactor
earlier in the project.
@mosullivan93 mosullivan93 marked this pull request as ready for review April 7, 2023 17:06
@mosullivan93
Copy link
Contributor Author

mosullivan93 commented Apr 7, 2023

I ran the benchmarks using a VM in the Cloud. It's usually a win for the AVX512 functions, but partialsort for the double is one example where it's quite a poor showing.

Processor Specifications
mosullivan@sprvm:~/x86-simd-sort$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  2
    Socket(s):           1
    Stepping:            8
    BogoMIPS:            5399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq
                          ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms inv
                         pcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 arat avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512
                         _vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   96 KiB (2 instances)
  L1i:                   64 KiB (2 instances)
  L2:                    4 MiB (2 instances)
  L3:                    105 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-3
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
Benchmark Results
mosullivan@sprvm:~/x86-simd-sort$ builddir/benchexe
2023-04-07T17:17:24+00:00
Running builddir/benchexe
Run on (4 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 2048 KiB (x2)
  L3 Unified 107520 KiB (x1)
Load Average: 0.00, 0.01, 0.05
---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
avx512_qsort<float>/10000                   36419 ns        36400 ns        19196
avx512_qsort<float>/1000000               7749142 ns      7748919 ns           90
stdsort<float>/10000                       526191 ns       526209 ns         1317
stdsort<float>/1000000                   82711190 ns     82704719 ns            8
avx512_qsort<uint32_t>/10000                28995 ns        28965 ns        24315
avx512_qsort<uint32_t>/1000000            6876938 ns      6876420 ns          102
stdsort<uint32_t>/10000                    491308 ns       491315 ns         1428
stdsort<uint32_t>/1000000                75152076 ns     75150045 ns            9
avx512_qsort<int32_t>/10000                 28514 ns        28489 ns        24564
avx512_qsort<int32_t>/1000000             6909871 ns      6909331 ns          101
stdsort<int32_t>/10000                     491652 ns       491667 ns         1421
stdsort<int32_t>/1000000                 75098996 ns     75094472 ns            9
avx512_qsort<double>/10000                  56425 ns        56438 ns        12351
avx512_qsort<double>/1000000             14057493 ns     14057058 ns           50
stdsort<double>/10000                      551104 ns       551150 ns         1257
stdsort<double>/1000000                  82489930 ns     82487387 ns            8
avx512_qsort<uint64_t>/10000                70208 ns        70220 ns         9971
avx512_qsort<uint64_t>/1000000           15469083 ns     15469019 ns           45
stdsort<uint64_t>/10000                    492595 ns       492634 ns         1413
stdsort<uint64_t>/1000000                74305155 ns     74304009 ns            9
avx512_qsort<int64_t>/10000                 68945 ns        68956 ns        10134
avx512_qsort<int64_t>/1000000            15449459 ns     15449523 ns           45
stdsort<int64_t>/10000                     504893 ns       504941 ns         1372
stdsort<int64_t>/10000000               903496547 ns    903483031 ns            1
avx512_qsort<uint16_t>/10000                28235 ns        28233 ns        24855
avx512_qsort<uint16_t>/1000000            5537195 ns      5536932 ns          126
stdsort<uint16_t>/10000                    479453 ns       479494 ns         1459
stdsort<uint16_t>/1000000                66950805 ns     66951663 ns           10
avx512_qsort<int16_t>/10000                 28330 ns        28328 ns        24801
avx512_qsort<int16_t>/1000000             5594429 ns      5594154 ns          125
stdsort<int16_t>/10000                     513220 ns       513290 ns         1364
stdsort<int16_t>/10000000               710608213 ns    710562888 ns            1
avx512_qselect<float>/10000                  4135 ns         4108 ns       170726
avx512_qselect<float>/1000000              743200 ns       743158 ns          938
stdnthelement<float>/10000                  11474 ns        11451 ns        60980
stdnthelement<float>/1000000              8855424 ns      8854214 ns           79
avx512_qselect<uint32_t>/10000               3761 ns         3726 ns       187541
avx512_qselect<uint32_t>/1000000           496308 ns       496217 ns         1409
stdnthelement<uint32_t>/10000               55725 ns        55696 ns        12496
stdnthelement<uint32_t>/1000000           9100662 ns      9100702 ns           77
avx512_qselect<int32_t>/10000                3768 ns         3736 ns       187754
avx512_qselect<int32_t>/1000000            499718 ns       499696 ns         1391
stdnthelement<int32_t>/10000                56962 ns        56933 ns        12302
stdnthelement<int32_t>/1000000            9079168 ns      9078421 ns           77
avx512_qselect<double>/10000                 8332 ns         8337 ns        84178
avx512_qselect<double>/1000000            2141389 ns      2141445 ns          326
stdnthelement<double>/10000                  7915 ns         7917 ns        88235
stdnthelement<double>/1000000             4967374 ns      4966940 ns          141
avx512_qselect<uint64_t>/10000               9164 ns         9172 ns        79497
avx512_qselect<uint64_t>/1000000          1465192 ns      1465144 ns          477
stdnthelement<uint64_t>/10000               11017 ns        11022 ns        63207
stdnthelement<uint64_t>/1000000           2948274 ns      2948173 ns          237
avx512_qselect<int64_t>/10000                8958 ns         8966 ns        79522
avx512_qselect<int64_t>/1000000           1451397 ns      1451508 ns          481
stdnthelement<int64_t>/10000                11187 ns        11194 ns        62101
stdnthelement<int64_t>/10000000          67233379 ns     67230696 ns           10
avx512_qselect<uint16_t>/10000               3258 ns         3261 ns       214639
avx512_qselect<uint16_t>/1000000           347956 ns       347783 ns         2005
stdnthelement<uint16_t>/10000                9466 ns         9467 ns        74058
stdnthelement<uint16_t>/1000000           7548821 ns      7548371 ns           93
avx512_qselect<int16_t>/10000                3377 ns         3377 ns       206972
avx512_qselect<int16_t>/1000000            358745 ns       358582 ns         1940
stdnthelement<int16_t>/10000                21092 ns        21095 ns        33449
stdnthelement<int16_t>/10000000          42250069 ns     42249112 ns           17
avx512_partial_qsort<float>/10000            4183 ns         4149 ns       168186
avx512_partial_qsort<float>/1000000        700434 ns       700399 ns         1005
stdpartialsort<float>/10000                  5921 ns         5888 ns       118828
stdpartialsort<float>/1000000              706571 ns       706512 ns          991
avx512_partial_qsort<uint32_t>/10000         3801 ns         3770 ns       185552
avx512_partial_qsort<uint32_t>/1000000     497738 ns       497645 ns         1396
stdpartialsort<uint32_t>/10000               7090 ns         7059 ns        98285
stdpartialsort<uint32_t>/1000000           572867 ns       572781 ns         1189
avx512_partial_qsort<int32_t>/10000          3797 ns         3763 ns       186396
avx512_partial_qsort<int32_t>/1000000      500951 ns       500845 ns         1394
stdpartialsort<int32_t>/10000                4121 ns         4089 ns       171082
stdpartialsort<int32_t>/1000000            488873 ns       488810 ns         1430
avx512_partial_qsort<double>/10000           8343 ns         8347 ns        83674
avx512_partial_qsort<double>/1000000      2156376 ns      2156418 ns          325
stdpartialsort<double>/10000                 5919 ns         5917 ns       118255
stdpartialsort<double>/1000000             708954 ns       708924 ns          987
avx512_partial_qsort<uint64_t>/10000         9203 ns         9211 ns        75823
avx512_partial_qsort<uint64_t>/1000000    1473890 ns      1474012 ns          477
stdpartialsort<uint64_t>/10000               7340 ns         7338 ns        95410
stdpartialsort<uint64_t>/1000000           828136 ns       828003 ns          842
avx512_partial_qsort<int64_t>/10000          8672 ns         8679 ns        76123
avx512_partial_qsort<int64_t>/1000000     1452034 ns      1451971 ns          482
stdpartialsort<int64_t>/10000                6926 ns         6924 ns        99165
stdpartialsort<int64_t>/10000000         10931774 ns     10931544 ns           64
avx512_partial_qsort<uint16_t>/10000         3312 ns         3312 ns       211509
avx512_partial_qsort<uint16_t>/1000000     348054 ns       347884 ns         2010
stdpartialsort<uint16_t>/10000               4000 ns         4002 ns       174916
stdpartialsort<uint16_t>/1000000           493094 ns       492933 ns         1423
avx512_partial_qsort<int16_t>/10000          3409 ns         3410 ns       204978
avx512_partial_qsort<int16_t>/1000000      359398 ns       359230 ns         1948
stdpartialsort<int16_t>/10000                4054 ns         4055 ns       172481
stdpartialsort<int16_t>/10000000          5976863 ns      5976280 ns          116
avx512_qsort<_Float16>/10000                37803 ns        37806 ns        19371
avx512_qsort<_Float16>/1000000            8490571 ns      8489842 ns           80
stdsort<_Float16>/10000                    546481 ns       546537 ns         1209
stdsort<_Float16>/1000000                63195434 ns     63196323 ns           11
avx512_qselect<_Float16>/10000               3398 ns         3399 ns       167466
avx512_qselect<_Float16>/1000000           429212 ns       429101 ns         1379
stdnthelement<_Float16>/10000                7483 ns         7485 ns       154007
stdnthelement<_Float16>/1000000           6389101 ns      6388882 ns          100
avx512_partial_qsort<_Float16>/10000         4569 ns         4571 ns       198538
avx512_partial_qsort<_Float16>/1000000     414681 ns       414553 ns         1558
stdpartialsort<_Float16>/10000               7169 ns         7171 ns        97345
stdpartialsort<_Float16>/1000000           751945 ns       751840 ns          949

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent addition to the library. Thanks a ton for your work. The test and benchmark coverage is good too. LGTM apart from minor comments.


template <typename T>
inline void avx512_partial_qsort(T *arr, int64_t k, int64_t arrsize) {
avx512_qselect<T>(arr, k, arrsize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to interpret k as the index of the array rather than k^th element. That way the calls to avx512_qselect and nth_element look consistent.

avx512_qselect(arr, k, N);
std::nth_element(arr, arr + k, arr + N); 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was obviously meant for avx512_qselect. avx512_partial_qsort seems to be consistent with std:partial_sort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avx512_qselect now treats the parameter k as the index to align with std::nth_element. The avx512_partial_qsort function has been updated to reflect this change.

@r-devulap
Copy link
Member

r-devulap commented Apr 13, 2023

BTW, avx512_partial_qsort doesn't fare well against std::partial_sort for tiny values of k (relative to array size). Here are the benchmark numbers for k = 1000.

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
avx512_partial_qsort<float>/10000            9788 ns         9796 ns        71442
avx512_partial_qsort<float>/1000000       1094838 ns      1094518 ns          642
stdpartialsort<float>/10000                242822 ns       242808 ns         2881
stdpartialsort<float>/1000000             1211880 ns      1211480 ns          578
avx512_partial_qsort<uint32_t>/10000         8996 ns         9002 ns        78752
avx512_partial_qsort<uint32_t>/1000000     891157 ns       890889 ns          795
stdpartialsort<uint32_t>/10000             229334 ns       229334 ns         3051
stdpartialsort<uint32_t>/1000000          1276025 ns      1275644 ns          549
avx512_partial_qsort<int32_t>/10000          9043 ns         9050 ns        77891
avx512_partial_qsort<int32_t>/1000000      893288 ns       893044 ns          791
stdpartialsort<int32_t>/10000              233705 ns       233688 ns         2995
stdpartialsort<int32_t>/1000000           1122401 ns      1122031 ns          625
avx512_partial_qsort<double>/10000          16554 ns        16561 ns        42076
avx512_partial_qsort<double>/1000000      3168667 ns      3168046 ns          222
stdpartialsort<double>/10000               258611 ns       258593 ns         2707
stdpartialsort<double>/1000000            1543721 ns      1543545 ns          453
avx512_partial_qsort<uint64_t>/10000        18125 ns        18131 ns        38195
avx512_partial_qsort<uint64_t>/1000000    2200902 ns      2200552 ns          319
stdpartialsort<uint64_t>/10000             238763 ns       238747 ns         2933
stdpartialsort<uint64_t>/1000000          2004298 ns      2004187 ns          350
avx512_partial_qsort<int64_t>/10000         18295 ns        18303 ns        38023
avx512_partial_qsort<int64_t>/1000000     2214439 ns      2214266 ns          316
stdpartialsort<int64_t>/10000              238458 ns       238436 ns         2936
stdpartialsort<int64_t>/10000000         10328611 ns     10327139 ns           65

I am not sure if a typical use case of partial sort is small or large values of 'k', but we could make a note of this in the release notes.

@r-devulap
Copy link
Member

Lets modify our benchmarks to reflect this. Instead of benchmarking for different array sizes, lets benchmark for different k values on a fixed array.

arr_bkp = arr;

/* Choose random index to sort up until */
int k = get_uniform_rand_array<int64_t>(1, ARRSIZE, 1).front();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets modify this benchmark to use k as an argument. We could benchmark with array size fixed to 10000 and various of values of k = {10, 100, 1000, 5000}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avx512_partial_qsort benchmarks updated to allow k to vary.

arr_bkp = arr;

/* Choose random index to make sorted */
int k = get_uniform_rand_array<int64_t>(1, ARRSIZE, 1).front();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Same as partial sort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avx512_qselect benchmarks updated to allow k to vary.

The QuickSelect internal method is now phrased such that the position to
be sorted is given as an offset (in the same way that left points to the
first element and right points to the last element). Similarly, the
avx512_qselect method also now uses this interpretation.
The comment and variable names appear misleading as the function
actually returns the position of the element immediately following the
last which is less than the pivot.
@mosullivan93
Copy link
Contributor Author

mosullivan93 commented Apr 16, 2023

The requested changes have been actioned as I understand them, please have another look over when you can. As for the common values of k, I'm not sure either, I tend to have it small (e.g. <100) in my particular use case. I'm not sure exactly how std::partial_sort is implemented or whether the algorithm used there could benefit from the use of SIMD instructions.


Updated Benchmarks
------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
avx512_qselect<float>/10                  4826 ns         4798 ns       146411
avx512_qselect<float>/100                 4823 ns         4800 ns       145792
avx512_qselect<float>/1000                4957 ns         4934 ns       141742
avx512_qselect<float>/5000                4725 ns         4683 ns       150206
stdnthelement<float>/10                  11344 ns        11325 ns        62073
stdnthelement<float>/100                 11980 ns        11952 ns        59213
stdnthelement<float>/1000                18484 ns        18458 ns        38033
stdnthelement<float>/5000                65817 ns        65791 ns        10567
avx512_qselect<uint32_t>/10               3925 ns         3896 ns       179664
avx512_qselect<uint32_t>/100              3816 ns         3789 ns       184183
avx512_qselect<uint32_t>/1000             3729 ns         3700 ns       189637
avx512_qselect<uint32_t>/5000             3827 ns         3793 ns       184615
stdnthelement<uint32_t>/10               53282 ns        53259 ns        12784
stdnthelement<uint32_t>/100              54836 ns        54815 ns        12641
stdnthelement<uint32_t>/1000             50653 ns        50630 ns        13746
stdnthelement<uint32_t>/5000             47105 ns        47077 ns        14895
avx512_qselect<int32_t>/10                4036 ns         4009 ns       178904
avx512_qselect<int32_t>/100               4013 ns         3986 ns       176087
avx512_qselect<int32_t>/1000              3900 ns         3868 ns       180936
avx512_qselect<int32_t>/5000              3828 ns         3789 ns       184852
stdnthelement<int32_t>/10                53937 ns        53920 ns        13200
stdnthelement<int32_t>/100               55126 ns        55109 ns        12685
stdnthelement<int32_t>/1000              51060 ns        51035 ns        10000
stdnthelement<int32_t>/5000              46580 ns        46556 ns        14938
avx512_qselect<double>/10                 7966 ns         7963 ns        87699
avx512_qselect<double>/100                7898 ns         7897 ns        88154
avx512_qselect<double>/1000               7424 ns         7421 ns        93066
avx512_qselect<double>/5000               9345 ns         9345 ns        74878
stdnthelement<double>/10                  6894 ns         6895 ns       101406
stdnthelement<double>/100                 6993 ns         6993 ns        99198
stdnthelement<double>/1000               12097 ns        12099 ns        57440
stdnthelement<double>/5000               70638 ns        70641 ns         9825
avx512_qselect<uint64_t>/10               8894 ns         8894 ns        78266
avx512_qselect<uint64_t>/100              8892 ns         8889 ns        78657
avx512_qselect<uint64_t>/1000             9029 ns         9027 ns        77224
avx512_qselect<uint64_t>/5000             7593 ns         7592 ns        92604
stdnthelement<uint64_t>/10               11076 ns        11073 ns        60545
stdnthelement<uint64_t>/100              11184 ns        11184 ns        63401
stdnthelement<uint64_t>/1000             12457 ns        12458 ns        55911
stdnthelement<uint64_t>/5000             30543 ns        30543 ns        22881
avx512_qselect<int64_t>/10                8406 ns         8407 ns        83453
avx512_qselect<int64_t>/100               8385 ns         8385 ns        81437
avx512_qselect<int64_t>/1000              8552 ns         8552 ns        81340
avx512_qselect<int64_t>/5000              7110 ns         7112 ns        92947
stdnthelement<int64_t>/10                10924 ns        10924 ns        63187
stdnthelement<int64_t>/100               11084 ns        11084 ns        62825
stdnthelement<int64_t>/1000              12419 ns        12419 ns        56956
stdnthelement<int64_t>/5000              30264 ns        30266 ns        23448
avx512_qselect<uint16_t>/10               3331 ns         3333 ns       209914
avx512_qselect<uint16_t>/100              3379 ns         3381 ns       207433
avx512_qselect<uint16_t>/1000             3375 ns         3377 ns       207044
avx512_qselect<uint16_t>/5000             3656 ns         3658 ns       189764
stdnthelement<uint16_t>/10               25459 ns        25462 ns        27513
stdnthelement<uint16_t>/100              24612 ns        24615 ns        28277
stdnthelement<uint16_t>/1000             79429 ns        79436 ns         8652
stdnthelement<uint16_t>/5000             34849 ns        34855 ns        19986
avx512_qselect<int16_t>/10                3382 ns         3384 ns       206767
avx512_qselect<int16_t>/100               3422 ns         3424 ns       204251
avx512_qselect<int16_t>/1000              3428 ns         3430 ns       203527
avx512_qselect<int16_t>/5000              3701 ns         3702 ns       188839
stdnthelement<int16_t>/10                10218 ns        10221 ns        70233
stdnthelement<int16_t>/100                9995 ns         9997 ns        69719
stdnthelement<int16_t>/1000              52794 ns        52796 ns        13335
stdnthelement<int16_t>/5000              12993 ns        12995 ns        53974
avx512_partial_qsort<float>/10            4859 ns         4840 ns       144368
avx512_partial_qsort<float>/100           5006 ns         4982 ns       138161
avx512_partial_qsort<float>/1000          7785 ns         7767 ns        89891
avx512_partial_qsort<float>/5000         20731 ns        20684 ns        33869
stdpartialsort<float>/10                  6531 ns         6501 ns       107739
stdpartialsort<float>/100                11786 ns        11760 ns        59486
stdpartialsort<float>/1000              232772 ns       232762 ns         2996
stdpartialsort<float>/5000              718178 ns       718216 ns          972
avx512_partial_qsort<uint32_t>/10         3941 ns         3914 ns       178956
avx512_partial_qsort<uint32_t>/100        3978 ns         3952 ns       176919
avx512_partial_qsort<uint32_t>/1000       6309 ns         6282 ns       114493
avx512_partial_qsort<uint32_t>/5000      17858 ns        17827 ns        39103
stdpartialsort<uint32_t>/10               4866 ns         4829 ns       144865
stdpartialsort<uint32_t>/100             11088 ns        11064 ns        64543
stdpartialsort<uint32_t>/1000           211636 ns       211634 ns         3299
stdpartialsort<uint32_t>/5000           658142 ns       658145 ns         1054
avx512_partial_qsort<int32_t>/10          4028 ns         4001 ns       178300
avx512_partial_qsort<int32_t>/100         3981 ns         3958 ns       176678
avx512_partial_qsort<int32_t>/1000        6100 ns         6072 ns       115128
avx512_partial_qsort<int32_t>/5000       17720 ns        17686 ns        39487
stdpartialsort<int32_t>/10                7493 ns         7457 ns        95426
stdpartialsort<int32_t>/100              13588 ns        13560 ns        53199
stdpartialsort<int32_t>/1000            205491 ns       205482 ns         3366
stdpartialsort<int32_t>/5000            655074 ns       655093 ns         1068
avx512_partial_qsort<double>/10           8001 ns         8004 ns        86123
avx512_partial_qsort<double>/100          8158 ns         8161 ns        86008
avx512_partial_qsort<double>/1000        11790 ns        11793 ns        59446
avx512_partial_qsort<double>/5000        36172 ns        36176 ns        19376
stdpartialsort<double>/10                 6276 ns         6267 ns       112038
stdpartialsort<double>/100               12423 ns        12413 ns        56378
stdpartialsort<double>/1000             235425 ns       235432 ns         2986
stdpartialsort<double>/5000             752369 ns       752428 ns          914
avx512_partial_qsort<uint64_t>/10         8927 ns         8928 ns        78382
avx512_partial_qsort<uint64_t>/100        9296 ns         9296 ns        75426
avx512_partial_qsort<uint64_t>/1000      14933 ns        14930 ns        46910
avx512_partial_qsort<uint64_t>/5000      40249 ns        40246 ns        17404
stdpartialsort<uint64_t>/10               4858 ns         4848 ns       144347
stdpartialsort<uint64_t>/100             10989 ns        10979 ns        64277
stdpartialsort<uint64_t>/1000           222731 ns       222738 ns         3143
stdpartialsort<uint64_t>/5000           686252 ns       686297 ns         1015
avx512_partial_qsort<int64_t>/10          8571 ns         8572 ns        80526
avx512_partial_qsort<int64_t>/100         8801 ns         8800 ns        79405
avx512_partial_qsort<int64_t>/1000       15066 ns        15066 ns        46443
avx512_partial_qsort<int64_t>/5000       40297 ns        40300 ns        17340
stdpartialsort<int64_t>/10                7949 ns         7941 ns        88242
stdpartialsort<int64_t>/100              13630 ns        13619 ns        51707
stdpartialsort<int64_t>/1000            220432 ns       220439 ns         3167
stdpartialsort<int64_t>/5000            695909 ns       695951 ns         1002
avx512_partial_qsort<uint16_t>/10         3402 ns         3401 ns       205506
avx512_partial_qsort<uint16_t>/100        3510 ns         3511 ns       201042
avx512_partial_qsort<uint16_t>/1000       5784 ns         5784 ns       120913
avx512_partial_qsort<uint16_t>/5000      16443 ns        16447 ns        42612
stdpartialsort<uint16_t>/10               6846 ns         6848 ns       104173
stdpartialsort<uint16_t>/100             14026 ns        14025 ns        49777
stdpartialsort<uint16_t>/1000           206246 ns       206260 ns         3324
stdpartialsort<uint16_t>/5000           663026 ns       663080 ns         1049
avx512_partial_qsort<int16_t>/10          3423 ns         3425 ns       204374
avx512_partial_qsort<int16_t>/100         3527 ns         3529 ns       198203
avx512_partial_qsort<int16_t>/1000        5854 ns         5857 ns       118890
avx512_partial_qsort<int16_t>/5000       16560 ns        16563 ns        41834
stdpartialsort<int16_t>/10                4441 ns         4442 ns       157534
stdpartialsort<int16_t>/100              10760 ns        10761 ns        65449
stdpartialsort<int16_t>/1000            210106 ns       210114 ns         3293
stdpartialsort<int16_t>/5000            666538 ns       666600 ns         1042
avx512_qsort<_Float16>/10000             38231 ns        38232 ns        19263
avx512_qsort<_Float16>/1000000         8614182 ns      8613989 ns           79
stdsort<_Float16>/10000                 542981 ns       543025 ns         1237
stdsort<_Float16>/1000000             63344408 ns     63341515 ns           11
avx512_qselect<_Float16>/10               3540 ns         3541 ns       159730
avx512_qselect<_Float16>/100              4103 ns         4104 ns       194056
avx512_qselect<_Float16>/1000             4450 ns         4450 ns       186536
avx512_qselect<_Float16>/5000             4613 ns         4613 ns       184328
stdnthelement<_Float16>/10               43077 ns        43083 ns        79580
stdnthelement<_Float16>/100              51659 ns        51662 ns        10000
stdnthelement<_Float16>/1000             10474 ns        10476 ns        54754
stdnthelement<_Float16>/5000             62674 ns        62678 ns        10000
avx512_partial_qsort<_Float16>/10         3824 ns         3826 ns       161367
avx512_partial_qsort<_Float16>/100        4312 ns         4313 ns       177678
avx512_partial_qsort<_Float16>/1000       6892 ns         6895 ns        97355
avx512_partial_qsort<_Float16>/5000      22241 ns        22244 ns        32179
stdpartialsort<_Float16>/10               6513 ns         6515 ns       107400
stdpartialsort<_Float16>/100             11632 ns        11631 ns        62839
stdpartialsort<_Float16>/1000           259799 ns       259821 ns         2759
stdpartialsort<_Float16>/5000           746416 ns       746482 ns          929

@r-devulap
Copy link
Member

avx512_partial_qsort<_Float16>/10         3824 ns         3826 ns       161367
avx512_partial_qsort<_Float16>/100        4312 ns         4313 ns       177678
avx512_partial_qsort<_Float16>/1000       6892 ns         6895 ns        97355
avx512_partial_qsort<_Float16>/5000      22241 ns        22244 ns        32179
stdpartialsort<_Float16>/10               6513 ns         6515 ns       107400
stdpartialsort<_Float16>/100             11632 ns        11631 ns        62839
stdpartialsort<_Float16>/1000           259799 ns       259821 ns         2759
stdpartialsort<_Float16>/5000           746416 ns       746482 ns          929

Was this run on Intel Sapphire Rapids?

@WilliamTambellini
Copy link
Contributor

Very cool @mosullivan93
Would you like me to try your PR on a Intel(R) Xeon(R) Platinum 8375C (AWS C6i) ?
Note: for deep learning usecase, k is even smaller (say 4) and n is bigger: say at least 100k (eg. the vocab of GPT is about 250k IIRC). Would advice to retouch the benchmarker to test such usecase.

@mosullivan93
Copy link
Contributor Author

mosullivan93 commented Apr 18, 2023

avx512_partial_qsort<_Float16>/10         3824 ns         3826 ns       161367
avx512_partial_qsort<_Float16>/100        4312 ns         4313 ns       177678
avx512_partial_qsort<_Float16>/1000       6892 ns         6895 ns        97355
avx512_partial_qsort<_Float16>/5000      22241 ns        22244 ns        32179
stdpartialsort<_Float16>/10               6513 ns         6515 ns       107400
stdpartialsort<_Float16>/100             11632 ns        11631 ns        62839
stdpartialsort<_Float16>/1000           259799 ns       259821 ns         2759
stdpartialsort<_Float16>/5000           746416 ns       746482 ns          929

Was this run on Intel Sapphire Rapids?

@r-devulap: Yea, I'm running these benchmarks on one of the C3 preview VMs on Google Cloud (Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz).

@WilliamTambellini: Thanks for the offer and sharing your insight on typical use cases. I don't want you to incur any costs for testing. The 8481C are currently free to use while on public preview.

@both
I've done some custom benchmarking on another branch (mosullivan93/x86-simd-sort@b4bb179) to test a smaller value of k and larger array sizes. Let me know if these (or a subset of them) should be added to the PR.


Processor Specifications
mosullivan@sprvm-1:~/x86-simd-sort$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8481C CPU @ 2.70GHz
    CPU family:          6
    Model:               143
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            8
    BogoMIPS:            5399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclm
                         ulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi
                         2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 arat avx512vbmi umip avx512_vbmi2 gfni vaes vp
                         clmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   192 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    8 MiB (4 instances)
  L3:                    105 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected
Benchmark Results
mosullivan@sprvm-1:~/x86-simd-sort$ builddir/benchexe
2023-04-18T04:54:58+00:00
Running builddir/benchexe
Run on (8 X 2700 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 2048 KiB (x4)
  L3 Unified 107520 KiB (x1)
Load Average: 0.20, 0.07, 0.10
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
avx512_qsort<float>/10000                       36546 ns        36530 ns        19115
avx512_qsort<float>/1000000                   7546703 ns      7546512 ns           92
stdsort<float>/10000                           531294 ns       531322 ns         1282
stdsort<float>/1000000                       81606797 ns     81606082 ns            9
avx512_qsort<uint32_t>/10000                    29939 ns        29927 ns        23116
avx512_qsort<uint32_t>/1000000                6809980 ns      6809485 ns          103
stdsort<uint32_t>/10000                        488850 ns       488866 ns         1431
stdsort<uint32_t>/1000000                    75064355 ns     75061505 ns            9
avx512_qsort<int32_t>/10000                     29839 ns        29825 ns        24086
avx512_qsort<int32_t>/1000000                 6762761 ns      6762677 ns          103
stdsort<int32_t>/10000                         487808 ns       487825 ns         1428
stdsort<int32_t>/1000000                     75157822 ns     75155483 ns            9
avx512_qsort<double>/10000                      52216 ns        52227 ns        13117
avx512_qsort<double>/1000000                 14094736 ns     14094611 ns           50
stdsort<double>/10000                          542306 ns       542332 ns         1269
stdsort<double>/1000000                      81577498 ns     81576251 ns            9
avx512_qsort<uint64_t>/10000                    68711 ns        68722 ns        10158
avx512_qsort<uint64_t>/1000000               15656858 ns     15656527 ns           45
stdsort<uint64_t>/10000                        490750 ns       490784 ns         1433
stdsort<uint64_t>/1000000                    73993037 ns     73991820 ns            9
avx512_qsort<int64_t>/10000                     68152 ns        68164 ns        10271
avx512_qsort<int64_t>/1000000                15421770 ns     15421205 ns           45
stdsort<int64_t>/10000                         494573 ns       494593 ns         1422
stdsort<int64_t>/1000000                     74343555 ns     74341164 ns            9
avx512_qsort<uint16_t>/10000                    28291 ns        28291 ns        24718
avx512_qsort<uint16_t>/1000000                5624938 ns      5624785 ns          124
stdsort<uint16_t>/10000                        483085 ns       483110 ns         1446
stdsort<uint16_t>/1000000                    67343601 ns     67343311 ns           10
avx512_qsort<int16_t>/10000                     27931 ns        27930 ns        24700
avx512_qsort<int16_t>/1000000                 5562881 ns      5562897 ns          126
stdsort<int16_t>/10000                         478798 ns       478834 ns         1469
stdsort<int16_t>/1000000                     66807871 ns     66806451 ns           11
avx512_qselect<float>/5/10000                    4760 ns         4744 ns       148107
avx512_qselect<float>/10/10000                   4758 ns         4746 ns       146031
avx512_qselect<float>/100/10000                  4778 ns         4759 ns       147070
avx512_qselect<float>/1000/10000                 4889 ns         4871 ns       144082
avx512_qselect<float>/5000/10000                 4836 ns         4814 ns       145198
avx512_qselect<float>/5/100000                  43699 ns        43701 ns        16140
avx512_qselect<float>/10/100000                 44696 ns        44693 ns        15694
avx512_qselect<float>/100/100000                44161 ns        44158 ns        15612
avx512_qselect<float>/1000/100000               44684 ns        44685 ns        15648
avx512_qselect<float>/5000/100000               43796 ns        43793 ns        16093
avx512_qselect<float>/5/250000                 154549 ns       154518 ns         4421
avx512_qselect<float>/10/250000                155419 ns       155384 ns         4551
avx512_qselect<float>/100/250000               159639 ns       159606 ns         4482
avx512_qselect<float>/1000/250000              155274 ns       155235 ns         4511
avx512_qselect<float>/5000/250000              153665 ns       153619 ns         4370
stdnthelement<float>/5/10000                     8614 ns         8597 ns        82675
stdnthelement<float>/10/10000                    8754 ns         8737 ns        80734
stdnthelement<float>/100/10000                   8909 ns         8892 ns        77472
stdnthelement<float>/1000/10000                 24274 ns        24259 ns        28627
stdnthelement<float>/5000/10000                 72891 ns        72879 ns         9537
stdnthelement<float>/5/100000                  868493 ns       868550 ns          809
stdnthelement<float>/10/100000                 868814 ns       868856 ns          809
stdnthelement<float>/100/100000                873049 ns       873096 ns          805
stdnthelement<float>/1000/100000               874254 ns       874323 ns          790
stdnthelement<float>/5000/100000               904936 ns       905010 ns          770
stdnthelement<float>/5/250000                 2348588 ns      2348609 ns          298
stdnthelement<float>/10/250000                2348372 ns      2348441 ns          298
stdnthelement<float>/100/250000               2350980 ns      2350998 ns          298
stdnthelement<float>/1000/250000              2357860 ns      2357891 ns          297
stdnthelement<float>/5000/250000              2441402 ns      2441432 ns          287
avx512_qselect<uint32_t>/5/10000                 4122 ns         4102 ns       170670
avx512_qselect<uint32_t>/10/10000                4256 ns         4232 ns       155745
avx512_qselect<uint32_t>/100/10000               4042 ns         4023 ns       174043
avx512_qselect<uint32_t>/1000/10000              3937 ns         3914 ns       178785
avx512_qselect<uint32_t>/5000/10000              4235 ns         4211 ns       167972
avx512_qselect<uint32_t>/5/100000               57001 ns        57002 ns        12134
avx512_qselect<uint32_t>/10/100000              57956 ns        57958 ns        12192
avx512_qselect<uint32_t>/100/100000             58324 ns        58325 ns        12095
avx512_qselect<uint32_t>/1000/100000            57454 ns        57456 ns        12119
avx512_qselect<uint32_t>/5000/100000            57299 ns        57301 ns        12153
avx512_qselect<uint32_t>/5/250000              179469 ns       179428 ns         3865
avx512_qselect<uint32_t>/10/250000             180629 ns       180596 ns         3868
avx512_qselect<uint32_t>/100/250000            180989 ns       180957 ns         3868
avx512_qselect<uint32_t>/1000/250000           181152 ns       181113 ns         3860
avx512_qselect<uint32_t>/5000/250000           180309 ns       180269 ns         3864
stdnthelement<uint32_t>/5/10000                 38373 ns        38356 ns        18232
stdnthelement<uint32_t>/10/10000                38442 ns        38425 ns        17816
stdnthelement<uint32_t>/100/10000               45596 ns        45572 ns        16764
stdnthelement<uint32_t>/1000/10000              34181 ns        34166 ns        20988
stdnthelement<uint32_t>/5000/10000              27353 ns        27335 ns        26186
stdnthelement<uint32_t>/5/100000               795411 ns       795460 ns          883
stdnthelement<uint32_t>/10/100000              795419 ns       795471 ns          877
stdnthelement<uint32_t>/100/100000             806820 ns       806849 ns          859
stdnthelement<uint32_t>/1000/100000            802066 ns       802128 ns          867
stdnthelement<uint32_t>/5000/100000            844878 ns       844911 ns          827
stdnthelement<uint32_t>/5/250000              1975763 ns      1975816 ns          355
stdnthelement<uint32_t>/10/250000             1974138 ns      1974192 ns          354
stdnthelement<uint32_t>/100/250000            1973533 ns      1973561 ns          355
stdnthelement<uint32_t>/1000/250000           1984801 ns      1984855 ns          353
stdnthelement<uint32_t>/5000/250000           1993102 ns      1993160 ns          351
avx512_qselect<int32_t>/5/10000                  4131 ns         4110 ns       170537
avx512_qselect<int32_t>/10/10000                 4154 ns         4134 ns       170468
avx512_qselect<int32_t>/100/10000                4079 ns         4060 ns       173242
avx512_qselect<int32_t>/1000/10000               3954 ns         3934 ns       177973
avx512_qselect<int32_t>/5000/10000               4235 ns         4210 ns       166616
avx512_qselect<int32_t>/5/100000                35184 ns        35182 ns        19982
avx512_qselect<int32_t>/10/100000               35684 ns        35684 ns        19915
avx512_qselect<int32_t>/100/100000              35291 ns        35291 ns        19656
avx512_qselect<int32_t>/1000/100000             35161 ns        35159 ns        20202
avx512_qselect<int32_t>/5000/100000             35215 ns        35214 ns        20242
avx512_qselect<int32_t>/5/250000               160036 ns       159984 ns         4425
avx512_qselect<int32_t>/10/250000              157617 ns       157571 ns         4437
avx512_qselect<int32_t>/100/250000             156748 ns       156705 ns         4429
avx512_qselect<int32_t>/1000/250000            158496 ns       158451 ns         4405
avx512_qselect<int32_t>/5000/250000            156434 ns       156390 ns         4463
stdnthelement<int32_t>/5/10000                  35183 ns        35166 ns        20205
stdnthelement<int32_t>/10/10000                 34480 ns        34463 ns        20331
stdnthelement<int32_t>/100/10000                36868 ns        36850 ns        18756
stdnthelement<int32_t>/1000/10000               30665 ns        30647 ns        22556
stdnthelement<int32_t>/5000/10000               26535 ns        26516 ns        25677
stdnthelement<int32_t>/5/100000                797259 ns       797313 ns          876
stdnthelement<int32_t>/10/100000               801230 ns       801285 ns          869
stdnthelement<int32_t>/100/100000              815925 ns       815962 ns          855
stdnthelement<int32_t>/1000/100000             808539 ns       808577 ns          860
stdnthelement<int32_t>/5000/100000             845682 ns       845714 ns          827
stdnthelement<int32_t>/5/250000               1987003 ns      1987030 ns          353
stdnthelement<int32_t>/10/250000              1984676 ns      1984736 ns          352
stdnthelement<int32_t>/100/250000             1986682 ns      1986749 ns          353
stdnthelement<int32_t>/1000/250000            1997176 ns      1997161 ns          351
stdnthelement<int32_t>/5000/250000            2003845 ns      2003829 ns          349
avx512_qselect<double>/5/10000                   7158 ns         7164 ns        99749
avx512_qselect<double>/10/10000                  7063 ns         7070 ns        97711
avx512_qselect<double>/100/10000                 7044 ns         7051 ns        99990
avx512_qselect<double>/1000/10000                6614 ns         6620 ns       106233
avx512_qselect<double>/5000/10000                7967 ns         7973 ns        88231
avx512_qselect<double>/5/100000                150307 ns       150284 ns         4656
avx512_qselect<double>/10/100000               155114 ns       155096 ns         4409
avx512_qselect<double>/100/100000              153300 ns       153282 ns         4580
avx512_qselect<double>/1000/100000             154199 ns       154178 ns         4447
avx512_qselect<double>/5000/100000             157782 ns       157761 ns         4419
avx512_qselect<double>/5/250000                400756 ns       400676 ns         1734
avx512_qselect<double>/10/250000               414643 ns       414544 ns         1743
avx512_qselect<double>/100/250000              399819 ns       399728 ns         1759
avx512_qselect<double>/1000/250000             400285 ns       400206 ns         1758
avx512_qselect<double>/5000/250000             400919 ns       400833 ns         1690
stdnthelement<double>/5/10000                    7502 ns         7508 ns        97509
stdnthelement<double>/10/10000                   7680 ns         7685 ns        91079
stdnthelement<double>/100/10000                  7644 ns         7650 ns        89805
stdnthelement<double>/1000/10000                10087 ns        10094 ns        67998
stdnthelement<double>/5000/10000                61704 ns        61713 ns        11295
stdnthelement<double>/5/100000                 362566 ns       362563 ns         1905
stdnthelement<double>/10/100000                364133 ns       364119 ns         1933
stdnthelement<double>/100/100000               366645 ns       366650 ns         1921
stdnthelement<double>/1000/100000              360523 ns       360513 ns         1909
stdnthelement<double>/5000/100000              374835 ns       374830 ns         1863
stdnthelement<double>/5/250000                1813412 ns      1813415 ns          384
stdnthelement<double>/10/250000               1817529 ns      1817556 ns          384
stdnthelement<double>/100/250000              1812975 ns      1812867 ns          386
stdnthelement<double>/1000/250000             1837840 ns      1837856 ns          380
stdnthelement<double>/5000/250000             1936094 ns      1936079 ns          359
avx512_qselect<uint64_t>/5/10000                 8797 ns         8803 ns        79678
avx512_qselect<uint64_t>/10/10000                8808 ns         8814 ns        79647
avx512_qselect<uint64_t>/100/10000               8788 ns         8794 ns        79260
avx512_qselect<uint64_t>/1000/10000              8989 ns         8998 ns        77961
avx512_qselect<uint64_t>/5000/10000              7300 ns         7299 ns        97003
avx512_qselect<uint64_t>/5/100000              110040 ns       110010 ns         5481
avx512_qselect<uint64_t>/10/100000             118674 ns       118649 ns         5667
avx512_qselect<uint64_t>/100/100000            119758 ns       119729 ns         6020
avx512_qselect<uint64_t>/1000/100000           120264 ns       120237 ns         5623
avx512_qselect<uint64_t>/5000/100000           122318 ns       122287 ns         5885
avx512_qselect<uint64_t>/5/250000              354518 ns       354391 ns         1953
avx512_qselect<uint64_t>/10/250000             343227 ns       343086 ns         2027
avx512_qselect<uint64_t>/100/250000            350416 ns       350298 ns         2029
avx512_qselect<uint64_t>/1000/250000           347756 ns       347629 ns         1994
avx512_qselect<uint64_t>/5000/250000           340102 ns       339988 ns         2001
stdnthelement<uint64_t>/5/10000                  9097 ns         9104 ns        76801
stdnthelement<uint64_t>/10/10000                 9123 ns         9130 ns        76653
stdnthelement<uint64_t>/100/10000                9188 ns         9194 ns        75531
stdnthelement<uint64_t>/1000/10000              10485 ns        10488 ns        66467
stdnthelement<uint64_t>/5000/10000              41109 ns        41119 ns        16682
stdnthelement<uint64_t>/5/100000               851451 ns       851482 ns          826
stdnthelement<uint64_t>/10/100000              849354 ns       849399 ns          817
stdnthelement<uint64_t>/100/100000             856813 ns       856858 ns          814
stdnthelement<uint64_t>/1000/100000            860449 ns       860487 ns          812
stdnthelement<uint64_t>/5000/100000            866994 ns       867049 ns          811
stdnthelement<uint64_t>/5/250000              2205035 ns      2205053 ns          317
stdnthelement<uint64_t>/10/250000             2206156 ns      2206100 ns          318
stdnthelement<uint64_t>/100/250000            2202326 ns      2202319 ns          318
stdnthelement<uint64_t>/1000/250000           2204795 ns      2204786 ns          316
stdnthelement<uint64_t>/5000/250000           2257709 ns      2257653 ns          310
avx512_qselect<int64_t>/5/10000                  8821 ns         8825 ns        79479
avx512_qselect<int64_t>/10/10000                 8897 ns         8902 ns        79443
avx512_qselect<int64_t>/100/10000                8837 ns         8844 ns        79301
avx512_qselect<int64_t>/1000/10000               9009 ns         9016 ns        77495
avx512_qselect<int64_t>/5000/10000               7136 ns         7141 ns        97844
avx512_qselect<int64_t>/5/100000               115031 ns       115004 ns         6015
avx512_qselect<int64_t>/10/100000              115371 ns       115342 ns         6098
avx512_qselect<int64_t>/100/100000             114198 ns       114169 ns         6159
avx512_qselect<int64_t>/1000/100000            114238 ns       114211 ns         6174
avx512_qselect<int64_t>/5000/100000            114694 ns       114664 ns         6177
avx512_qselect<int64_t>/5/250000               352084 ns       351983 ns         2018
avx512_qselect<int64_t>/10/250000              346159 ns       346027 ns         2001
avx512_qselect<int64_t>/100/250000             346043 ns       345909 ns         2017
avx512_qselect<int64_t>/1000/250000            347265 ns       347150 ns         2029
avx512_qselect<int64_t>/5000/250000            345358 ns       345233 ns         2018
stdnthelement<int64_t>/5/10000                  11248 ns        11254 ns        62499
stdnthelement<int64_t>/10/10000                 11226 ns        11233 ns        62057
stdnthelement<int64_t>/100/10000                11301 ns        11308 ns        61748
stdnthelement<int64_t>/1000/10000               12417 ns        12421 ns        56311
stdnthelement<int64_t>/5000/10000               41299 ns        41307 ns        16984
stdnthelement<int64_t>/5/100000                856816 ns       856847 ns          811
stdnthelement<int64_t>/10/100000               855227 ns       855252 ns          812
stdnthelement<int64_t>/100/100000              860638 ns       860676 ns          809
stdnthelement<int64_t>/1000/100000             872162 ns       872223 ns          806
stdnthelement<int64_t>/5000/100000             871810 ns       871833 ns          797
stdnthelement<int64_t>/5/250000               2214095 ns      2214079 ns          316
stdnthelement<int64_t>/10/250000              2210080 ns      2210061 ns          316
stdnthelement<int64_t>/100/250000             2212167 ns      2212194 ns          316
stdnthelement<int64_t>/1000/250000            2214614 ns      2214577 ns          315
stdnthelement<int64_t>/5000/250000            2267539 ns      2267524 ns          309
avx512_qselect<uint16_t>/5/10000                 3372 ns         3376 ns       207020
avx512_qselect<uint16_t>/10/10000                3385 ns         3389 ns       206961
avx512_qselect<uint16_t>/100/10000               3459 ns         3456 ns       202193
avx512_qselect<uint16_t>/1000/10000              3466 ns         3467 ns       201412
avx512_qselect<uint16_t>/5000/10000              3721 ns         3723 ns       188104
avx512_qselect<uint16_t>/5/100000               24645 ns        24616 ns        28457
avx512_qselect<uint16_t>/10/100000              24590 ns        24560 ns        28363
avx512_qselect<uint16_t>/100/100000             24663 ns        24634 ns        28458
avx512_qselect<uint16_t>/1000/100000            24844 ns        24815 ns        28082
avx512_qselect<uint16_t>/5000/100000            24227 ns        24199 ns        28943
avx512_qselect<uint16_t>/5/250000               58883 ns        58877 ns        11797
avx512_qselect<uint16_t>/10/250000              58756 ns        58749 ns        11894
avx512_qselect<uint16_t>/100/250000             58909 ns        58901 ns        11883
avx512_qselect<uint16_t>/1000/250000            58698 ns        58692 ns        11903
avx512_qselect<uint16_t>/5000/250000            59153 ns        59145 ns        11852
stdnthelement<uint16_t>/5/10000                 23609 ns        23612 ns        29655
stdnthelement<uint16_t>/10/10000                26104 ns        26106 ns        27706
stdnthelement<uint16_t>/100/10000               27315 ns        27293 ns        25917
stdnthelement<uint16_t>/1000/10000              87661 ns        87668 ns         8785
stdnthelement<uint16_t>/5000/10000              38491 ns        38388 ns        17887
stdnthelement<uint16_t>/5/100000               535225 ns       535235 ns         1315
stdnthelement<uint16_t>/10/100000              535055 ns       535064 ns         1305
stdnthelement<uint16_t>/100/100000             544586 ns       544597 ns         1282
stdnthelement<uint16_t>/1000/100000            549416 ns       549419 ns         1272
stdnthelement<uint16_t>/5000/100000            560910 ns       560919 ns         1252
stdnthelement<uint16_t>/5/250000              2407438 ns      2407459 ns          291
stdnthelement<uint16_t>/10/250000             2404796 ns      2404851 ns          291
stdnthelement<uint16_t>/100/250000            2410891 ns      2410921 ns          290
stdnthelement<uint16_t>/1000/250000           2412071 ns      2412145 ns          290
stdnthelement<uint16_t>/5000/250000           2401782 ns      2401791 ns          291
avx512_qselect<int16_t>/5/10000                  3331 ns         3333 ns       208987
avx512_qselect<int16_t>/10/10000                 3330 ns         3331 ns       209751
avx512_qselect<int16_t>/100/10000                3373 ns         3374 ns       207397
avx512_qselect<int16_t>/1000/10000               3401 ns         3402 ns       205778
avx512_qselect<int16_t>/5000/10000               3648 ns         3648 ns       192046
avx512_qselect<int16_t>/5/100000                23815 ns        23786 ns        29339
avx512_qselect<int16_t>/10/100000               23822 ns        23791 ns        29367
avx512_qselect<int16_t>/100/100000              23824 ns        23795 ns        29383
avx512_qselect<int16_t>/1000/100000             23689 ns        23662 ns        29471
avx512_qselect<int16_t>/5000/100000             23482 ns        23455 ns        29906
avx512_qselect<int16_t>/5/250000                57127 ns        57119 ns        12250
avx512_qselect<int16_t>/10/250000               57288 ns        57250 ns        12214
avx512_qselect<int16_t>/100/250000              57255 ns        57247 ns        12234
avx512_qselect<int16_t>/1000/250000             57164 ns        57156 ns        12266
avx512_qselect<int16_t>/5000/250000             56885 ns        56877 ns        12289
stdnthelement<int16_t>/5/10000                  22824 ns        22827 ns        30759
stdnthelement<int16_t>/10/10000                 24650 ns        24653 ns        28026
stdnthelement<int16_t>/100/10000                23783 ns        23786 ns        29397
stdnthelement<int16_t>/1000/10000               81471 ns        81476 ns         8486
stdnthelement<int16_t>/5000/10000               35619 ns        35623 ns        19679
stdnthelement<int16_t>/5/100000                542789 ns       542779 ns         1283
stdnthelement<int16_t>/10/100000               574841 ns       574861 ns         1293
stdnthelement<int16_t>/100/100000              567441 ns       567454 ns         1096
stdnthelement<int16_t>/1000/100000             556303 ns       556319 ns         1258
stdnthelement<int16_t>/5000/100000             567813 ns       567848 ns         1232
stdnthelement<int16_t>/5/250000               2427380 ns      2427392 ns          288
stdnthelement<int16_t>/10/250000              2427029 ns      2427091 ns          288
stdnthelement<int16_t>/100/250000             2433108 ns      2433057 ns          287
stdnthelement<int16_t>/1000/250000            2437414 ns      2437473 ns          288
stdnthelement<int16_t>/5000/250000            2425573 ns      2425601 ns          289
avx512_partial_qsort<float>/5/10000              4784 ns         4764 ns       146617
avx512_partial_qsort<float>/10/10000             4781 ns         4762 ns       146709
avx512_partial_qsort<float>/100/10000            4921 ns         4905 ns       142704
avx512_partial_qsort<float>/1000/10000           7804 ns         7789 ns        89779
avx512_partial_qsort<float>/5000/10000          20664 ns        20643 ns        33918
avx512_partial_qsort<float>/5/100000            43090 ns        43092 ns        16074
avx512_partial_qsort<float>/10/100000           43690 ns        43693 ns        16382
avx512_partial_qsort<float>/100/100000          42920 ns        42924 ns        15972
avx512_partial_qsort<float>/1000/100000         46047 ns        46049 ns        15158
avx512_partial_qsort<float>/5000/100000         60756 ns        60764 ns        11558
avx512_partial_qsort<float>/5/250000           155479 ns       155443 ns         4541
avx512_partial_qsort<float>/10/250000          155800 ns       155762 ns         4637
avx512_partial_qsort<float>/100/250000         152167 ns       152117 ns         4443
avx512_partial_qsort<float>/1000/250000        157152 ns       157101 ns         4385
avx512_partial_qsort<float>/5000/250000        178012 ns       177967 ns         4027
stdpartialsort<float>/5/10000                    6039 ns         6021 ns       116152
stdpartialsort<float>/10/10000                   6510 ns         6493 ns       107854
stdpartialsort<float>/100/10000                 11680 ns        11662 ns        60386
stdpartialsort<float>/1000/10000               236097 ns       236104 ns         3005
stdpartialsort<float>/5000/10000               754352 ns       754385 ns          923
stdpartialsort<float>/5/100000                  52376 ns        52374 ns        13364
stdpartialsort<float>/10/100000                 53522 ns        53522 ns        13087
stdpartialsort<float>/100/100000                66674 ns        66668 ns        10519
stdpartialsort<float>/1000/100000              504935 ns       504959 ns         1388
stdpartialsort<float>/5000/100000             1982012 ns      1982063 ns          354
stdpartialsort<float>/5/250000                 129443 ns       129404 ns         5414
stdpartialsort<float>/10/250000                130424 ns       130381 ns         5370
stdpartialsort<float>/100/250000               147670 ns       147631 ns         4739
stdpartialsort<float>/1000/250000              668130 ns       668158 ns         1047
stdpartialsort<float>/5000/250000             2521280 ns      2521342 ns          278
avx512_partial_qsort<uint32_t>/5/10000           4155 ns         4134 ns       169393
avx512_partial_qsort<uint32_t>/10/10000          4142 ns         4121 ns       168701
avx512_partial_qsort<uint32_t>/100/10000         4177 ns         4158 ns       168210
avx512_partial_qsort<uint32_t>/1000/10000        6307 ns         6287 ns       111263
avx512_partial_qsort<uint32_t>/5000/10000       18211 ns        18195 ns        38524
avx512_partial_qsort<uint32_t>/5/100000         57602 ns        57603 ns        12083
avx512_partial_qsort<uint32_t>/10/100000        57303 ns        57305 ns        12062
avx512_partial_qsort<uint32_t>/100/100000       57921 ns        57925 ns        12106
avx512_partial_qsort<uint32_t>/1000/100000      59803 ns        59808 ns        11712
avx512_partial_qsort<uint32_t>/5000/100000      71373 ns        71376 ns         9745
avx512_partial_qsort<uint32_t>/5/250000        179892 ns       179857 ns         3883
avx512_partial_qsort<uint32_t>/10/250000       180899 ns       180871 ns         3852
avx512_partial_qsort<uint32_t>/100/250000      180101 ns       180073 ns         3852
avx512_partial_qsort<uint32_t>/1000/250000     183314 ns       183266 ns         3838
avx512_partial_qsort<uint32_t>/5000/250000     195241 ns       195205 ns         3584
stdpartialsort<uint32_t>/5/10000                 4207 ns         4189 ns       167094
stdpartialsort<uint32_t>/10/10000                4774 ns         4756 ns       147131
stdpartialsort<uint32_t>/100/10000              10259 ns        10242 ns        68533
stdpartialsort<uint32_t>/1000/10000            216093 ns       216106 ns         3226
stdpartialsort<uint32_t>/5000/10000            704433 ns       704466 ns          985
stdpartialsort<uint32_t>/5/100000               34803 ns        34797 ns        20115
stdpartialsort<uint32_t>/10/100000              35973 ns        35967 ns        19457
stdpartialsort<uint32_t>/100/100000             50082 ns        50082 ns        13985
stdpartialsort<uint32_t>/1000/100000           452933 ns       452950 ns         1544
stdpartialsort<uint32_t>/5000/100000          1854542 ns      1854596 ns          378
stdpartialsort<uint32_t>/5/250000               85819 ns        85758 ns         8164
stdpartialsort<uint32_t>/10/250000              87320 ns        87272 ns         8024
stdpartialsort<uint32_t>/100/250000            106499 ns       106457 ns         6574
stdpartialsort<uint32_t>/1000/250000           578043 ns       578044 ns         1212
stdpartialsort<uint32_t>/5000/250000          2352771 ns      2352794 ns          298
avx512_partial_qsort<int32_t>/5/10000            4167 ns         4146 ns       169170
avx512_partial_qsort<int32_t>/10/10000           4161 ns         4139 ns       169019
avx512_partial_qsort<int32_t>/100/10000          4218 ns         4198 ns       167544
avx512_partial_qsort<int32_t>/1000/10000         6297 ns         6278 ns       111431
avx512_partial_qsort<int32_t>/5000/10000        18052 ns        18038 ns        38770
avx512_partial_qsort<int32_t>/5/100000          35135 ns        35136 ns        19335
avx512_partial_qsort<int32_t>/10/100000         34970 ns        34972 ns        19853
avx512_partial_qsort<int32_t>/100/100000        35675 ns        35676 ns        18695
avx512_partial_qsort<int32_t>/1000/100000       37063 ns        37066 ns        19230
avx512_partial_qsort<int32_t>/5000/100000       49076 ns        49077 ns        14085
avx512_partial_qsort<int32_t>/5/250000         158554 ns       158507 ns         4412
avx512_partial_qsort<int32_t>/10/250000        158645 ns       158615 ns         4434
avx512_partial_qsort<int32_t>/100/250000       159237 ns       159210 ns         4420
avx512_partial_qsort<int32_t>/1000/250000      163677 ns       163646 ns         4354
avx512_partial_qsort<int32_t>/5000/250000      176629 ns       176592 ns         3933
stdpartialsort<int32_t>/5/10000                  6314 ns         6296 ns       108554
stdpartialsort<int32_t>/10/10000                 7338 ns         7321 ns        99074
stdpartialsort<int32_t>/100/10000               13867 ns        13853 ns        51396
stdpartialsort<int32_t>/1000/10000             216613 ns       216625 ns         3224
stdpartialsort<int32_t>/5000/10000             706113 ns       706167 ns          990
stdpartialsort<int32_t>/5/100000                61781 ns        61777 ns        11329
stdpartialsort<int32_t>/10/100000               50996 ns        50993 ns        13875
stdpartialsort<int32_t>/100/100000              67301 ns        67302 ns        10204
stdpartialsort<int32_t>/1000/100000            465374 ns       465401 ns         1504
stdpartialsort<int32_t>/5000/100000           1846130 ns      1846145 ns          380
stdpartialsort<int32_t>/5/250000               131237 ns       131188 ns         5705
stdpartialsort<int32_t>/10/250000              132351 ns       132297 ns         5349
stdpartialsort<int32_t>/100/250000             152814 ns       152791 ns         4147
stdpartialsort<int32_t>/1000/250000            603068 ns       603056 ns         1163
stdpartialsort<int32_t>/5000/250000           2360949 ns      2361014 ns          297
avx512_partial_qsort<double>/5/10000             7108 ns         7119 ns        96952
avx512_partial_qsort<double>/10/10000            7293 ns         7302 ns        95205
avx512_partial_qsort<double>/100/10000           7357 ns         7364 ns        94774
avx512_partial_qsort<double>/1000/10000         10577 ns        10584 ns        64127
avx512_partial_qsort<double>/5000/10000         32527 ns        32535 ns        21598
avx512_partial_qsort<double>/5/100000          161023 ns       161014 ns         4170
avx512_partial_qsort<double>/10/100000         150762 ns       150750 ns         4626
avx512_partial_qsort<double>/100/100000        153012 ns       152998 ns         4647
avx512_partial_qsort<double>/1000/100000       155889 ns       155878 ns         4440
avx512_partial_qsort<double>/5000/100000       183512 ns       183506 ns         3819
avx512_partial_qsort<double>/5/250000          404052 ns       403981 ns         1700
avx512_partial_qsort<double>/10/250000         407392 ns       407303 ns         1717
avx512_partial_qsort<double>/100/250000        419087 ns       419031 ns         1697
avx512_partial_qsort<double>/1000/250000       403081 ns       402999 ns         1732
avx512_partial_qsort<double>/5000/250000       436609 ns       436537 ns         1617
stdpartialsort<double>/5/10000                   5860 ns         5858 ns       119445
stdpartialsort<double>/10/10000                  6229 ns         6227 ns       112593
stdpartialsort<double>/100/10000                12844 ns        12844 ns        55573
stdpartialsort<double>/1000/10000              242951 ns       242962 ns         2848
stdpartialsort<double>/5000/10000              745811 ns       745864 ns          938
stdpartialsort<double>/5/100000                 52276 ns        52247 ns        13415
stdpartialsort<double>/10/100000                53061 ns        53030 ns        13204
stdpartialsort<double>/100/100000               67371 ns        67339 ns        10438
stdpartialsort<double>/1000/100000             506049 ns       506062 ns         1381
stdpartialsort<double>/5000/100000            1990522 ns      1990575 ns          352
stdpartialsort<double>/5/250000                130250 ns       130104 ns         5381
stdpartialsort<double>/10/250000               131560 ns       131415 ns         5338
stdpartialsort<double>/100/250000              150860 ns       150714 ns         4637
stdpartialsort<double>/1000/250000             676832 ns       676780 ns         1036
stdpartialsort<double>/5000/250000            2560117 ns      2560153 ns          274
avx512_partial_qsort<uint64_t>/5/10000           8871 ns         8877 ns        78426
avx512_partial_qsort<uint64_t>/10/10000          9541 ns         9550 ns        77223
avx512_partial_qsort<uint64_t>/100/10000         9338 ns         9345 ns        75978
avx512_partial_qsort<uint64_t>/1000/10000       15390 ns        15397 ns        45589
avx512_partial_qsort<uint64_t>/5000/10000       40001 ns        40010 ns        17508
avx512_partial_qsort<uint64_t>/5/100000        127375 ns       127358 ns         4671
avx512_partial_qsort<uint64_t>/10/100000       114169 ns       114151 ns         5770
avx512_partial_qsort<uint64_t>/100/100000      117132 ns       117117 ns         6551
avx512_partial_qsort<uint64_t>/1000/100000     112633 ns       112612 ns         6048
avx512_partial_qsort<uint64_t>/5000/100000     142300 ns       142288 ns         4988
avx512_partial_qsort<uint64_t>/5/250000        341957 ns       341843 ns         2047
avx512_partial_qsort<uint64_t>/10/250000       342174 ns       342055 ns         2072
avx512_partial_qsort<uint64_t>/100/250000      352411 ns       352281 ns         2058
avx512_partial_qsort<uint64_t>/1000/250000     350234 ns       350102 ns         2002
avx512_partial_qsort<uint64_t>/5000/250000     376417 ns       376306 ns         1868
stdpartialsort<uint64_t>/5/10000                 4106 ns         4101 ns       170560
stdpartialsort<uint64_t>/10/10000                4836 ns         4834 ns       144835
stdpartialsort<uint64_t>/100/10000              10426 ns        10422 ns        67213
stdpartialsort<uint64_t>/1000/10000            212960 ns       212963 ns         3274
stdpartialsort<uint64_t>/5000/10000            694712 ns       694771 ns         1002
stdpartialsort<uint64_t>/5/100000               34826 ns        34792 ns        20100
stdpartialsort<uint64_t>/10/100000              36010 ns        35977 ns        19426
stdpartialsort<uint64_t>/100/100000             51063 ns        51031 ns        13690
stdpartialsort<uint64_t>/1000/100000           458494 ns       458507 ns         1526
stdpartialsort<uint64_t>/5000/100000          1864329 ns      1864416 ns          376
stdpartialsort<uint64_t>/5/250000               90465 ns        90312 ns         7479
stdpartialsort<uint64_t>/10/250000              91399 ns        91243 ns         7640
stdpartialsort<uint64_t>/100/250000            110302 ns       110149 ns         6429
stdpartialsort<uint64_t>/1000/250000           595573 ns       595511 ns         1174
stdpartialsort<uint64_t>/5000/250000          2384880 ns      2384895 ns          294
avx512_partial_qsort<int64_t>/5/10000            9109 ns         9117 ns        79176
avx512_partial_qsort<int64_t>/10/10000           9044 ns         9056 ns        68024
avx512_partial_qsort<int64_t>/100/10000          9206 ns         9214 ns        76247
avx512_partial_qsort<int64_t>/1000/10000        15499 ns        15507 ns        45103
avx512_partial_qsort<int64_t>/5000/10000        40064 ns        40072 ns        17527
avx512_partial_qsort<int64_t>/5/100000         113981 ns       113959 ns         5976
avx512_partial_qsort<int64_t>/10/100000        115372 ns       115352 ns         6132
avx512_partial_qsort<int64_t>/100/100000       116875 ns       116855 ns         6102
avx512_partial_qsort<int64_t>/1000/100000      122425 ns       122409 ns         5786
avx512_partial_qsort<int64_t>/5000/100000      157979 ns       157972 ns         4472
avx512_partial_qsort<int64_t>/5/250000         346476 ns       346355 ns         1993
avx512_partial_qsort<int64_t>/10/250000        343482 ns       343364 ns         2020
avx512_partial_qsort<int64_t>/100/250000       347941 ns       347792 ns         2036
avx512_partial_qsort<int64_t>/1000/250000      353338 ns       353217 ns         1985
avx512_partial_qsort<int64_t>/5000/250000      388811 ns       388705 ns         1800
stdpartialsort<int64_t>/5/10000                  7334 ns         7332 ns        95779
stdpartialsort<int64_t>/10/10000                 7890 ns         7883 ns        88763
stdpartialsort<int64_t>/100/10000               13179 ns        13178 ns        51343
stdpartialsort<int64_t>/1000/10000             215350 ns       215357 ns         3195
stdpartialsort<int64_t>/5000/10000             677097 ns       677150 ns         1021
stdpartialsort<int64_t>/5/100000                67803 ns        67774 ns        10327
stdpartialsort<int64_t>/10/100000               68871 ns        68845 ns        10162
stdpartialsort<int64_t>/100/100000              81961 ns        81931 ns         8532
stdpartialsort<int64_t>/1000/100000            458089 ns       458094 ns         1505
stdpartialsort<int64_t>/5000/100000           1831979 ns      1832032 ns          382
stdpartialsort<int64_t>/5/250000               187038 ns       186898 ns         3905
stdpartialsort<int64_t>/10/250000              171052 ns       170918 ns         3415
stdpartialsort<int64_t>/100/250000             187379 ns       187256 ns         3740
stdpartialsort<int64_t>/1000/250000            638464 ns       638394 ns         1098
stdpartialsort<int64_t>/5000/250000           2395204 ns      2395265 ns          292
avx512_partial_qsort<uint16_t>/5/10000           3419 ns         3421 ns       204984
avx512_partial_qsort<uint16_t>/10/10000          3419 ns         3421 ns       204605
avx512_partial_qsort<uint16_t>/100/10000         3525 ns         3528 ns       198214
avx512_partial_qsort<uint16_t>/1000/10000        5868 ns         5869 ns       119498
avx512_partial_qsort<uint16_t>/5000/10000       16578 ns        16580 ns        42217
avx512_partial_qsort<uint16_t>/5/100000         24663 ns        24633 ns        28191
avx512_partial_qsort<uint16_t>/10/100000        24721 ns        24690 ns        28131
avx512_partial_qsort<uint16_t>/100/100000       24927 ns        24893 ns        28063
avx512_partial_qsort<uint16_t>/1000/100000      26887 ns        26853 ns        26080
avx512_partial_qsort<uint16_t>/5000/100000      37737 ns        37711 ns        18622
avx512_partial_qsort<uint16_t>/5/250000         59237 ns        59232 ns        11877
avx512_partial_qsort<uint16_t>/10/250000        58962 ns        58957 ns        11719
avx512_partial_qsort<uint16_t>/100/250000       59452 ns        59448 ns        11912
avx512_partial_qsort<uint16_t>/1000/250000      60974 ns        60969 ns        11293
avx512_partial_qsort<uint16_t>/5000/250000      72921 ns        72918 ns         9719
stdpartialsort<uint16_t>/5/10000                 6092 ns         6094 ns       114001
stdpartialsort<uint16_t>/10/10000                6946 ns         6944 ns        98413
stdpartialsort<uint16_t>/100/10000              13010 ns        13011 ns        53324
stdpartialsort<uint16_t>/1000/10000            216757 ns       216774 ns         3212
stdpartialsort<uint16_t>/5000/10000            689553 ns       689590 ns         1036
stdpartialsort<uint16_t>/5/100000               47430 ns        47393 ns        14895
stdpartialsort<uint16_t>/10/100000              60475 ns        60439 ns        11368
stdpartialsort<uint16_t>/100/100000             66681 ns        66645 ns        10148
stdpartialsort<uint16_t>/1000/100000           458732 ns       458716 ns         1522
stdpartialsort<uint16_t>/5000/100000          1840599 ns      1840606 ns          380
stdpartialsort<uint16_t>/5/250000              117097 ns       117083 ns         6627
stdpartialsort<uint16_t>/10/250000             137395 ns       137384 ns         5100
stdpartialsort<uint16_t>/100/250000            139730 ns       139721 ns         4999
stdpartialsort<uint16_t>/1000/250000           607632 ns       607643 ns         1161
stdpartialsort<uint16_t>/5000/250000          2395258 ns      2395305 ns          292
avx512_partial_qsort<int16_t>/5/10000            3368 ns         3369 ns       207751
avx512_partial_qsort<int16_t>/10/10000           3369 ns         3369 ns       207818
avx512_partial_qsort<int16_t>/100/10000          3475 ns         3475 ns       201278
avx512_partial_qsort<int16_t>/1000/10000         5780 ns         5781 ns       120738
avx512_partial_qsort<int16_t>/5000/10000        16323 ns        16325 ns        42870
avx512_partial_qsort<int16_t>/5/100000          23839 ns        23806 ns        29367
avx512_partial_qsort<int16_t>/10/100000         23843 ns        23807 ns        29392
avx512_partial_qsort<int16_t>/100/100000        23935 ns        23900 ns        29345
avx512_partial_qsort<int16_t>/1000/100000       25920 ns        25885 ns        27083
avx512_partial_qsort<int16_t>/5000/100000       36570 ns        36540 ns        19113
avx512_partial_qsort<int16_t>/5/250000          57219 ns        57212 ns        12246
avx512_partial_qsort<int16_t>/10/250000         57087 ns        57081 ns        12238
avx512_partial_qsort<int16_t>/100/250000        57323 ns        57318 ns        12221
avx512_partial_qsort<int16_t>/1000/250000       59212 ns        59207 ns        11836
avx512_partial_qsort<int16_t>/5000/250000       70376 ns        70373 ns         9959
stdpartialsort<int16_t>/5/10000                  4070 ns         4071 ns       171740
stdpartialsort<int16_t>/10/10000                 4438 ns         4439 ns       158386
stdpartialsort<int16_t>/100/10000               10568 ns        10569 ns        67059
stdpartialsort<int16_t>/1000/10000             213553 ns       213567 ns         3244
stdpartialsort<int16_t>/5000/10000             692820 ns       692860 ns          999
stdpartialsort<int16_t>/5/100000                34814 ns        34777 ns        19978
stdpartialsort<int16_t>/10/100000               35451 ns        35416 ns        19751
stdpartialsort<int16_t>/100/100000              49213 ns        49172 ns        14231
stdpartialsort<int16_t>/1000/100000            462044 ns       462027 ns         1516
stdpartialsort<int16_t>/5000/100000           1833974 ns      1833984 ns          381
stdpartialsort<int16_t>/5/250000                85363 ns        85354 ns         8229
stdpartialsort<int16_t>/10/250000               86593 ns        86578 ns         8134
stdpartialsort<int16_t>/100/250000             104286 ns       104275 ns         6707
stdpartialsort<int16_t>/1000/250000            598232 ns       598234 ns         1172
stdpartialsort<int16_t>/5000/250000           2375414 ns      2375491 ns          297
avx512_qsort<_Float16>/10000                    38293 ns        38300 ns        18847
avx512_qsort<_Float16>/1000000                8611913 ns      8611946 ns           79
stdsort<_Float16>/10000                        550979 ns       551009 ns         1200
stdsort<_Float16>/1000000                    63714125 ns     63713371 ns           11
avx512_qselect<_Float16>/5/10000                 3567 ns         3569 ns       160519
avx512_qselect<_Float16>/10/10000                4037 ns         4038 ns       195470
avx512_qselect<_Float16>/100/10000               4384 ns         4385 ns       188922
avx512_qselect<_Float16>/1000/10000              3396 ns         3396 ns       183083
avx512_qselect<_Float16>/5000/10000              4011 ns         4013 ns       170157
avx512_qselect<_Float16>/5/100000               32078 ns        32058 ns        23082
avx512_qselect<_Float16>/10/100000              26188 ns        26160 ns        21929
avx512_qselect<_Float16>/100/100000             33203 ns        33172 ns        20540
avx512_qselect<_Float16>/1000/100000            27390 ns        27362 ns        22644
avx512_qselect<_Float16>/5000/100000            31220 ns        31190 ns        24947
avx512_qselect<_Float16>/5/250000              108454 ns       108455 ns         8084
avx512_qselect<_Float16>/10/250000              97921 ns        97923 ns         8746
avx512_qselect<_Float16>/100/250000             71994 ns        71992 ns         9292
avx512_qselect<_Float16>/1000/250000            84457 ns        84456 ns         8627
avx512_qselect<_Float16>/5000/250000           103203 ns       103202 ns         7181
stdnthelement<_Float16>/5/10000                 31164 ns        31167 ns        72166
stdnthelement<_Float16>/10/10000                28689 ns        28691 ns        19294
stdnthelement<_Float16>/100/10000               47057 ns        47061 ns        79363
stdnthelement<_Float16>/1000/10000              76531 ns        76536 ns        10000
stdnthelement<_Float16>/5000/10000              70087 ns        70093 ns        17499
stdnthelement<_Float16>/5/100000               965785 ns       965831 ns          783
stdnthelement<_Float16>/10/100000              635602 ns       635604 ns         1090
stdnthelement<_Float16>/100/100000             894011 ns       894044 ns         1158
stdnthelement<_Float16>/1000/100000            825690 ns       825701 ns         1107
stdnthelement<_Float16>/5000/100000           1007711 ns      1007743 ns          820
stdnthelement<_Float16>/5/250000              1972959 ns      1973005 ns          608
stdnthelement<_Float16>/10/250000             2091273 ns      2091290 ns          601
stdnthelement<_Float16>/100/250000            1358563 ns      1358578 ns          565
stdnthelement<_Float16>/1000/250000           1283385 ns      1283421 ns          830
stdnthelement<_Float16>/5000/250000           2533562 ns      2533605 ns          945
avx512_partial_qsort<_Float16>/5/10000           4295 ns         4295 ns       181499
avx512_partial_qsort<_Float16>/10/10000          3664 ns         3665 ns       180190
avx512_partial_qsort<_Float16>/100/10000         3993 ns         3995 ns       151337
avx512_partial_qsort<_Float16>/1000/10000        6685 ns         6686 ns        85237
avx512_partial_qsort<_Float16>/5000/10000       22011 ns        22015 ns        31040
avx512_partial_qsort<_Float16>/5/100000         34392 ns        34362 ns        20577
avx512_partial_qsort<_Float16>/10/100000        32231 ns        32201 ns        20123
avx512_partial_qsort<_Float16>/100/100000       29634 ns        29605 ns        23604
avx512_partial_qsort<_Float16>/1000/100000      35431 ns        35399 ns        22220
avx512_partial_qsort<_Float16>/5000/100000      50278 ns        50251 ns        10000
avx512_partial_qsort<_Float16>/5/250000         98955 ns        98953 ns         9409
avx512_partial_qsort<_Float16>/10/250000        78152 ns        78150 ns         8437
avx512_partial_qsort<_Float16>/100/250000       76757 ns        76749 ns         9102
avx512_partial_qsort<_Float16>/1000/250000      78192 ns        78190 ns         9514
avx512_partial_qsort<_Float16>/5000/250000      98030 ns        98029 ns         7014
stdpartialsort<_Float16>/5/10000                 6037 ns         6038 ns       117704
stdpartialsort<_Float16>/10/10000                6363 ns         6362 ns       101908
stdpartialsort<_Float16>/100/10000              11768 ns        11771 ns        59117
stdpartialsort<_Float16>/1000/10000            254631 ns       254641 ns         2633
stdpartialsort<_Float16>/5000/10000            749936 ns       749992 ns          916
stdpartialsort<_Float16>/5/100000               52291 ns        52252 ns        13298
stdpartialsort<_Float16>/10/100000              54011 ns        53977 ns        13136
stdpartialsort<_Float16>/100/100000             70263 ns        70235 ns        10177
stdpartialsort<_Float16>/1000/100000           512666 ns       512645 ns         1376
stdpartialsort<_Float16>/5000/100000          2006954 ns      2007022 ns          348
stdpartialsort<_Float16>/5/250000              129003 ns       128992 ns         5420
stdpartialsort<_Float16>/10/250000             130369 ns       130365 ns         5374
stdpartialsort<_Float16>/100/250000            153014 ns       153004 ns         4568
stdpartialsort<_Float16>/1000/250000           683530 ns       683555 ns         1027
stdpartialsort<_Float16>/5000/250000          2576531 ns      2576537 ns          272

@r-devulap
Copy link
Member

r-devulap commented Apr 19, 2023

Looking at integer data for k values of 5 and 10, here is a high level overview of the benchmarks (float, doubles are slightly worser).

avx512_qselect is a clear win across all range of values and dtypes

std::nthelement v/s avx512_select:

dtype size k arrsize approx avx-512 speed up
16 bit 5 100000 7x
16 bit 5 250000 42x
32 bit 5 100000 14x
32 bit 5 250000 11x
64 bit 5 100000 8x
64 bit 5 250000 6x
16 bit 10 100000 21x
16 bit 10 250000 42x
32 bit 10 100000 14x
32 bit 10 250000 13x
64 bit 10 100000 7x
64 bit 10 250000 6.4x

avx512_partialsort is great for k > 100, but performs poorly when k values are smaller

std::partial sort v/s avx512_partial sort:

dtype size k arrsize approx avx-512 speed up
16 bit 5 100000 2x
16 bit 5 250000 2x
32 bit 5 100000 0.6x
32 bit 5 250000 0.5x
64 bit 5 100000 0.27x
64 bit 5 250000 0.26x
16 bit 10 100000 2.4x
16 bit 10 250000 2.3x
32 bit 10 100000 0.63x
32 bit 10 250000 0.48x
64 bit 10 100000 0.31x
64 bit 10 250000 0.26x

Copy link
Member

@r-devulap r-devulap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much for your contribution. PR #33 adds more optimization to the vectorized partitioning function. I will rebase this patch to that and see if it helps with improving partial sort.

@r-devulap r-devulap merged commit 6709593 into numpy:main Apr 19, 2023
@mosullivan93
Copy link
Contributor Author

Thanks for your help in improving this PR. I'll keep an eye on this project to see if some of the enhancements you're making will be help shift the balance for the AVX partial methods. std::partial_sort's implementation is impressive, I might look more closely at it some day.

@r-devulap
Copy link
Member

r-devulap commented Apr 25, 2023

@mosullivan93 See #33 (comment). Unrolling the partition algorithm narrows the gap between avx512 and std::partial sort for small values of k

Benchmark                                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------
[stdpartialsort vs. avx512_partial_qsort]<float>/10                     -0.1264         -0.1250          6918          6044          6917          6053
[stdpartialsort vs. avx512_partial_qsort]<float>/100                    -0.7553         -0.7550         25413          6218         25412          6227
[stdpartialsort vs. avx512_partial_qsort]<float>/1000                   -0.9596         -0.9596        239209          9654        239202          9664
[stdpartialsort vs. avx512_partial_qsort]<float>/5000                   -0.9647         -0.9647        698310         24627        698288         24636
[stdpartialsort vs. avx512_partial_qsort]<uint32_t>/10                  -0.2492         -0.2477          6082          4567          6083          4576
[stdpartialsort vs. avx512_partial_qsort]<uint32_t>/100                 -0.6927         -0.6920         15331          4711         15330          4721
[stdpartialsort vs. avx512_partial_qsort]<uint32_t>/1000                -0.9682         -0.9682        231959          7370        231951          7379
[stdpartialsort vs. avx512_partial_qsort]<uint32_t>/5000                -0.9684         -0.9684        669613         21136        669579         21145
[stdpartialsort vs. avx512_partial_qsort]<int32_t>/10                   -0.4487         -0.4479          8308          4580          8312          4589
[stdpartialsort vs. avx512_partial_qsort]<int32_t>/100                  -0.7148         -0.7144         16601          4735         16610          4744
[stdpartialsort vs. avx512_partial_qsort]<int32_t>/1000                 -0.9676         -0.9676        227938          7377        227930          7386
[stdpartialsort vs. avx512_partial_qsort]<int32_t>/5000                 -0.9680         -0.9680        661007         21157        660980         21166
[stdpartialsort vs. avx512_partial_qsort]<double>/10                    +0.3622         +0.3643          6618          9015          6615          9024
[stdpartialsort vs. avx512_partial_qsort]<double>/100                   -0.4889         -0.4885         18107          9255         18114          9264
[stdpartialsort vs. avx512_partial_qsort]<double>/1000                  -0.9474         -0.9474        254806         13399        254802         13412
[stdpartialsort vs. avx512_partial_qsort]<double>/5000                  -0.9477         -0.9477        718405         37558        718395         37566
[stdpartialsort vs. avx512_partial_qsort]<uint64_t>/10                  +0.1073         +0.1091          6216          6883          6214          6892
[stdpartialsort vs. avx512_partial_qsort]<uint64_t>/100                 -0.5243         -0.5237         15109          7188         15108          7195
[stdpartialsort vs. avx512_partial_qsort]<uint64_t>/1000                -0.9327         -0.9326        229822         15477        229818         15488
[stdpartialsort vs. avx512_partial_qsort]<uint64_t>/5000                -0.9372         -0.9372        648900         40727        648902         40735
[stdpartialsort vs. avx512_partial_qsort]<int64_t>/10                   -0.5261         -0.5253         14470          6858         14466          6866
[stdpartialsort vs. avx512_partial_qsort]<int64_t>/100                  -0.7017         -0.7013         24157          7206         24155          7216
[stdpartialsort vs. avx512_partial_qsort]<int64_t>/1000                 -0.9365         -0.9365        243154         15442        243144         15452
[stdpartialsort vs. avx512_partial_qsort]<int64_t>/5000                 -0.9398         -0.9398        676266         40703        676242         40713

@mosullivan93
Copy link
Contributor Author

Those are some impressive performance improvements. I've never heard of this unroll directive.

@danlark1
Copy link

danlark1 commented Jun 2, 2023

Note that nth_element is not the fastest general purpose algorithm. floyd_rivest seems to be at least 2x faster https://github.com/danlark1/miniselect#performance-results

Though it's unlikely to beat simd versions represented here. I'll benchmark once I find AVX512 machine

@r-devulap
Copy link
Member

@danlark1 good to know. I can add floyd_rivest to the benchmarks. Does STL support it?

@danlark1
Copy link

danlark1 commented Jun 6, 2023

@danlark1 good to know. I can add floyd_rivest to the benchmarks. Does STL support it?

STLs of all compilers and toolchains are unlikely to support Floyd rivest because of floating point arithmetic. That breaks constexpr algorithms. However, for benchmarking it should be just a drop in, it is properly templated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

About partial sort/topk
4 participants