Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve qsort #33

Merged
merged 11 commits into from
Apr 26, 2023
Merged

Improve qsort #33

merged 11 commits into from
Apr 26, 2023

Conversation

r-devulap
Copy link
Contributor

@r-devulap r-devulap commented Apr 17, 2023

benchmark comparison:

64-bit improvements:

up-to 1.64x speed up for avx512_qsort, avx512_qselect and avx512_partial_qsort using the unrolled avx512_partition:

Benchmark                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------
avx512_qsort<double>/10000                         -0.2855         -0.2855         80600         57586         80608         57595
avx512_qsort<double>/1000000                       -0.3934         -0.3934      16793069      10187304      16792501      10186808
avx512_qsort<uint64_t>/10000                       -0.2095         -0.2096         85913         67912         85927         67917
avx512_qsort<uint64_t>/1000000                     -0.3451         -0.3451      17036085      11156221      17035521      11155822
avx512_qsort<int64_t>/10000                        -0.1953         -0.1952         84153         67721         84159         67728
avx512_qsort<int64_t>/1000000                      -0.3479         -0.3479      17082368      11138920      17081635      11138410
avx512_partial_qsort<double>/10                    -0.1180         -0.1176          9909          8740          9910          8745
avx512_partial_qsort<double>/100                   -0.0999         -0.0996          9938          8946          9942          8951
avx512_partial_qsort<double>/1000                  -0.0519         -0.0517         13879         13159         13885         13167
avx512_partial_qsort<double>/5000                  -0.2305         -0.2304         46276         35610         46283         35618
avx512_partial_qsort<uint64_t>/10                  -0.3791         -0.3790         11295          7013         11300          7018
avx512_partial_qsort<uint64_t>/100                 -0.3612         -0.3613         11530          7365         11538          7370
avx512_partial_qsort<uint64_t>/1000                -0.3142         -0.3141         17270         11844         17277         11851
avx512_partial_qsort<uint64_t>/5000                -0.0923         -0.0923         44120         40048         44128         40054
avx512_partial_qsort<int64_t>/10                   -0.3567         -0.3565         10923          7027         10927          7031
avx512_partial_qsort<int64_t>/100                  -0.3503         -0.3503         11342          7369         11348          7373
avx512_partial_qsort<int64_t>/1000                 -0.3053         -0.3051         17040         11838         17046         11844
avx512_partial_qsort<int64_t>/5000                 -0.0884         -0.0884         43852         39976         43860         39982
avx512_qselect<double>/10                          -0.1167         -0.1164          9885          8731          9888          8737
avx512_qselect<double>/100                         -0.1020         -0.1017          9721          8729          9724          8735
avx512_qselect<double>/1000                        -0.0445         -0.0443          9098          8693          9103          8699
avx512_qselect<double>/5000                        -0.2248         -0.2246         11236          8710         11241          8716
avx512_qselect<uint64_t>/10                        -0.3807         -0.3806         11246          6965         11253          6969
avx512_qselect<uint64_t>/100                       -0.3769         -0.3768         11213          6987         11219          6992
avx512_qselect<uint64_t>/1000                      -0.3964         -0.3964         11364          6859         11369          6863
avx512_qselect<uint64_t>/5000                      -0.2009         -0.2010          9211          7360          9218          7365
avx512_qselect<int64_t>/10                         -0.3645         -0.3643         10959          6964         10963          6969
avx512_qselect<int64_t>/100                        -0.3626         -0.3624         10951          6980         10956          6986
avx512_qselect<int64_t>/1000                       -0.3825         -0.3823         11069          6836         11074          6841
avx512_qselect<int64_t>/5000                       -0.1803         -0.1800          8976          7358          8981          7364

32-bit improvements:

up-to 1.2 - 1.3x speed up for avx512_qsort, avx512_qselect and avx512_partial_qsort using the unrolled avx512_partition:

avx512_qsort<float>/10000                          -0.1018         -0.1016         46769         42010         46770         42018
avx512_qsort<float>/1000000                        -0.1857         -0.1857       8962300       7298007       8961476       7297402
avx512_qsort<uint32_t>/10000                       -0.0619         -0.0619         34952         32788         34955         32793
avx512_qsort<uint32_t>/1000000                     -0.1859         -0.1859       7746334       6306397       7745718       6305695
avx512_qsort<int32_t>/10000                        -0.0682         -0.0683         35036         32647         35043         32651
avx512_qsort<int32_t>/1000000                      -0.1849         -0.1849       7727002       6298105       7726168       6297432
avx512_qselect<float>/10                           -0.1104         -0.1098          6869          6111          6873          6118
avx512_qselect<float>/100                          -0.0951         -0.0945          6877          6223          6880          6230
avx512_qselect<float>/1000                         -0.1056         -0.1049          7035          6292          7039          6301
avx512_qselect<float>/5000                         -0.1129         -0.1123          6971          6185          6975          6192
avx512_qselect<uint32_t>/10                        -0.2647         -0.2638          6133          4510          6136          4517
avx512_qselect<uint32_t>/100                       -0.2555         -0.2548          6110          4549          6113          4556
avx512_qselect<uint32_t>/1000                      -0.1976         -0.1967          5996          4811          5999          4819
avx512_qselect<uint32_t>/5000                      -0.1522         -0.1516          6090          5164          6094          5170
avx512_qselect<int32_t>/10                         -0.2645         -0.2638          6173          4540          6176          4547
avx512_qselect<int32_t>/100                        -0.2581         -0.2574          6120          4540          6123          4547
avx512_qselect<int32_t>/1000                       -0.1941         -0.1934          5985          4823          5988          4830
avx512_qselect<int32_t>/5000                       -0.1629         -0.1625          6159          5156          6163          5162
avx512_partial_qsort<float>/10                     -0.1235         -0.1227          6977          6116          6980          6123
avx512_partial_qsort<float>/100                    -0.1017         -0.1014          7069          6350          7076          6358
avx512_partial_qsort<float>/1000                   -0.0707         -0.0703         10515          9772         10519          9779
avx512_partial_qsort<float>/5000                   -0.0633         -0.0632         26409         24737         26411         24743
avx512_partial_qsort<uint32_t>/10                  -0.2710         -0.2703          6229          4541          6231          4547
avx512_partial_qsort<uint32_t>/100                 -0.2583         -0.2576          6291          4666          6294          4673
avx512_partial_qsort<uint32_t>/1000                -0.1492         -0.1486          8610          7325          8613          7333
avx512_partial_qsort<uint32_t>/5000                -0.0702         -0.0701         22641         21052         22645         21058
avx512_partial_qsort<int32_t>/10                   -0.2590         -0.2583          6182          4581          6185          4588
avx512_partial_qsort<int32_t>/100                  -0.2527         -0.2520          6226          4653          6229          4659
avx512_partial_qsort<int32_t>/1000                 -0.1492         -0.1485          8629          7342          8632          7349
avx512_partial_qsort<int32_t>/5000                 -0.0734         -0.0732         22636         20974         22640         20982

@r-devulap r-devulap changed the title Add bitonic sorting network of size 256 for 64-bit dtype Improve qsort Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant