Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use memset in deep_copy #3944

Merged
merged 19 commits into from
Apr 28, 2021
Merged

Conversation

masterleinad
Copy link
Contributor

Fixes #3930. As noted in #3930 (comment) this seems to be advantageous at least for Cuda but we should benchmark this a little better and also possibly extend to other backends.

@masterleinad
Copy link
Contributor Author

Some more results on V100

BM_generic<TestZeroing, char>/1024/real_time                                10945 ns        10941 ns        66940 Bandwidth=93.5618M/s
BM_generic<TestZeroing, char>/4096/real_time                                12519 ns        12495 ns        55205 Bandwidth=327.189M/s
BM_generic<TestZeroing, char>/32768/real_time                               15353 ns        15347 ns        45992 Bandwidth=2.13431G/s
BM_generic<TestZeroing, char>/262144/real_time                              15189 ns        15185 ns        44561 Bandwidth=17.2589G/s
BM_generic<TestZeroing, char>/2097152/real_time                             16357 ns        16356 ns        42601 Bandwidth=128.207G/s
BM_generic<TestZeroing, char>/16777216/real_time                            34268 ns        34264 ns        20098 Bandwidth=489.585G/s
BM_generic<TestZeroing, char>/134217728/real_time                          166694 ns       166679 ns         4176 Bandwidth=805.174G/s
BM_generic<TestZeroing, int>/1024/real_time                                 11333 ns        11333 ns        62242 Bandwidth=361.429M/s
BM_generic<TestZeroing, int>/4096/real_time                                 14342 ns        14340 ns        48861 Bandwidth=1.14241G/s
BM_generic<TestZeroing, int>/32768/real_time                                14371 ns        14370 ns        48697 Bandwidth=9.12081G/s
BM_generic<TestZeroing, int>/262144/real_time                               15385 ns        15383 ns        45913 Bandwidth=68.1568G/s
BM_generic<TestZeroing, int>/2097152/real_time                              24624 ns        24621 ns        27940 Bandwidth=340.673G/s
BM_generic<TestZeroing, int>/16777216/real_time                             91105 ns        91095 ns         7673 Bandwidth=736.612G/s
BM_generic<TestZeroing, int>/134217728/real_time                           620166 ns       620111 ns         1117 Bandwidth=865.689G/s
BM_generic<TestZeroing, double>/1024/real_time                              14340 ns        14340 ns        49530 Bandwidth=571.259M/s
BM_generic<TestZeroing, double>/4096/real_time                              14341 ns        14341 ns        48196 Bandwidth=2.28489G/s
BM_generic<TestZeroing, double>/32768/real_time                             14413 ns        14411 ns        49372 Bandwidth=18.1885G/s
BM_generic<TestZeroing, double>/262144/real_time                            15393 ns        15391 ns        45396 Bandwidth=136.237G/s
BM_generic<TestZeroing, double>/2097152/real_time                           33663 ns        33660 ns        20795 Bandwidth=498.389G/s
BM_generic<TestZeroing, double>/16777216/real_time                         166994 ns       166981 ns         4196 Bandwidth=803.728G/s
BM_generic<TestZeroing, double>/134217728/real_time                       1224932 ns      1224774 ns          562 Bandwidth=876.572G/s
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                14128 ns        14127 ns        50165 Bandwidth=1.15965G/s
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                14617 ns        14616 ns        47576 Bandwidth=4.48344G/s
BM_generic<TestZeroing, std::complex<double>>/32768/real_time               14987 ns        14987 ns        46648 Bandwidth=34.9833G/s
BM_generic<TestZeroing, std::complex<double>>/262144/real_time              16728 ns        16728 ns        40803 Bandwidth=250.729G/s
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time             52617 ns        52612 ns        13250 Bandwidth=637.717G/s
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time           317705 ns       317658 ns         2203 Bandwidth=844.922G/s
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time         2441073 ns      2440828 ns          283 Bandwidth=879.729G/s
BM_generic<TestZeroingAsync, char>/1024/real_time                            2647 ns         2647 ns       276542 Bandwidth=386.825M/s
BM_generic<TestZeroingAsync, char>/4096/real_time                            3783 ns         3778 ns       195623 Bandwidth=1082.82M/s
BM_generic<TestZeroingAsync, char>/32768/real_time                           5802 ns         5802 ns       122166 Bandwidth=5.64752G/s
BM_generic<TestZeroingAsync, char>/262144/real_time                          5678 ns         5678 ns       122513 Bandwidth=46.1698G/s
BM_generic<TestZeroingAsync, char>/2097152/real_time                         5658 ns         5658 ns       124720 Bandwidth=370.642G/s
BM_generic<TestZeroingAsync, char>/16777216/real_time                       20485 ns        20484 ns        36972 Bandwidth=818.999G/s
BM_generic<TestZeroingAsync, char>/134217728/real_time                     136611 ns       136605 ns        10000 Bandwidth=982.478G/s
BM_generic<TestZeroingAsync, int>/1024/real_time                             3531 ns         3531 ns       197778 Bandwidth=1.15993G/s
BM_generic<TestZeroingAsync, int>/4096/real_time                             5643 ns         5643 ns       120788 Bandwidth=2.90324G/s
BM_generic<TestZeroingAsync, int>/32768/real_time                            5695 ns         5695 ns       123691 Bandwidth=23.0169G/s
BM_generic<TestZeroingAsync, int>/262144/real_time                           5606 ns         5606 ns       125596 Bandwidth=187.05G/s
BM_generic<TestZeroingAsync, int>/2097152/real_time                         11273 ns        11273 ns        67872 Bandwidth=744.165G/s
BM_generic<TestZeroingAsync, int>/16777216/real_time                        69495 ns        69495 ns        10000 Bandwidth=965.665G/s
BM_generic<TestZeroingAsync, int>/134217728/real_time                      540093 ns       540087 ns        10000 Bandwidth=994.034G/s
BM_generic<TestZeroingAsync, double>/1024/real_time                          5907 ns         5906 ns       116793 Bandwidth=1.38674G/s
BM_generic<TestZeroingAsync, double>/4096/real_time                          5482 ns         5480 ns       125679 Bandwidth=5.97782G/s
BM_generic<TestZeroingAsync, double>/32768/real_time                         5438 ns         5437 ns       128386 Bandwidth=48.2045G/s
BM_generic<TestZeroingAsync, double>/262144/real_time                        6198 ns         6187 ns       128838 Bandwidth=338.34G/s
BM_generic<TestZeroingAsync, double>/2097152/real_time                      20543 ns        20535 ns        36872 Bandwidth=816.696G/s
BM_generic<TestZeroingAsync, double>/16777216/real_time                    136688 ns       136637 ns        10000 Bandwidth=981.925G/s
BM_generic<TestZeroingAsync, double>/134217728/real_time                  1080917 ns      1080549 ns        10000 Bandwidth=993.362G/s
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time            5449 ns         5449 ns       126052 Bandwidth=3.00683G/s
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time            5432 ns         5432 ns       129949 Bandwidth=12.0639G/s
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time           5539 ns         5538 ns       127130 Bandwidth=94.6498G/s
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time          5470 ns         5469 ns       121801 Bandwidth=766.746G/s
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time        37812 ns        37804 ns        19561 Bandwidth=887.394G/s
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time      271200 ns       271126 ns        10000 Bandwidth=989.806G/s
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time    2154152 ns      2153444 ns        10000 Bandwidth=996.905G/s

@masterleinad
Copy link
Contributor Author

SYCL V100 before

BM_generic<TestZeroing, char>/1024/real_time                                38913 ns        37985 ns        18295 Bandwidth=26.3154M/s
BM_generic<TestZeroing, char>/4096/real_time                                41817 ns        40897 ns        17074 Bandwidth=97.9509M/s
BM_generic<TestZeroing, char>/32768/real_time                               65708 ns        64264 ns        11404 Bandwidth=498.689M/s
BM_generic<TestZeroing, char>/262144/real_time                             106839 ns       105115 ns         6302 Bandwidth=2.45365G/s
BM_generic<TestZeroing, char>/2097152/real_time                            108282 ns       106084 ns         6793 Bandwidth=19.3676G/s
BM_generic<TestZeroing, char>/16777216/real_time                           141182 ns       138544 ns         4843 Bandwidth=118.834G/s
BM_generic<TestZeroing, char>/134217728/real_time                          435145 ns       429779 ns         1626 Bandwidth=308.444G/s
BM_generic<TestZeroing, int>/1024/real_time                                 39430 ns        38728 ns        17403 Bandwidth=103.88M/s
BM_generic<TestZeroing, int>/4096/real_time                                 41192 ns        40476 ns        17076 Bandwidth=397.748M/s
BM_generic<TestZeroing, int>/32768/real_time                                66310 ns        64989 ns         9903 Bandwidth=1.97666G/s
BM_generic<TestZeroing, int>/262144/real_time                              110413 ns       107924 ns         6227 Bandwidth=9.49686G/s
BM_generic<TestZeroing, int>/2097152/real_time                             118560 ns       116033 ns         5705 Bandwidth=70.7538G/s
BM_generic<TestZeroing, int>/16777216/real_time                            186984 ns       183509 ns         3787 Bandwidth=358.902G/s
BM_generic<TestZeroing, int>/134217728/real_time                           864087 ns       862800 ns          777 Bandwidth=621.316G/s
BM_generic<TestZeroing, double>/1024/real_time                              40095 ns        39098 ns        17294 Bandwidth=204.314M/s
BM_generic<TestZeroing, double>/4096/real_time                              41009 ns        40180 ns        17475 Bandwidth=799.04M/s
BM_generic<TestZeroing, double>/32768/real_time                             65544 ns        63969 ns        11510 Bandwidth=3.99953G/s
BM_generic<TestZeroing, double>/262144/real_time                           107423 ns       105215 ns         6271 Bandwidth=19.5223G/s
BM_generic<TestZeroing, double>/2097152/real_time                          125006 ns       122507 ns         5349 Bandwidth=134.211G/s
BM_generic<TestZeroing, double>/16777216/real_time                         267534 ns       262499 ns         2675 Bandwidth=501.684G/s
BM_generic<TestZeroing, double>/134217728/real_time                       1552510 ns      1552512 ns          491 Bandwidth=691.617G/s
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                40940 ns        39915 ns        17737 Bandwidth=400.198M/s
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                43241 ns        42612 ns        15815 Bandwidth=1.51561G/s
BM_generic<TestZeroing, std::complex<double>>/32768/real_time               80820 ns        79524 ns         8384 Bandwidth=6.48711G/s
BM_generic<TestZeroing, std::complex<double>>/262144/real_time             147639 ns       144386 ns         4739 Bandwidth=28.4093G/s
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time            243642 ns       238876 ns         2823 Bandwidth=137.72G/s
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time          1069944 ns      1069944 ns          722 Bandwidth=250.888G/s
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time         6736210 ns      6725450 ns           93 Bandwidth=318.797G/s
BM_generic<TestZeroingAsync, char>/1024/real_time                           37985 ns        37302 ns        13530 Bandwidth=26.9578M/s
BM_generic<TestZeroingAsync, char>/4096/real_time                           39550 ns        38658 ns        17429 Bandwidth=103.565M/s
BM_generic<TestZeroingAsync, char>/32768/real_time                          64450 ns        63273 ns         9981 Bandwidth=508.428M/s
BM_generic<TestZeroingAsync, char>/262144/real_time                        105336 ns       102908 ns         6980 Bandwidth=2.48865G/s
BM_generic<TestZeroingAsync, char>/2097152/real_time                       104312 ns       102063 ns         6486 Bandwidth=20.1046G/s
BM_generic<TestZeroingAsync, char>/16777216/real_time                      141527 ns       138721 ns         4979 Bandwidth=118.544G/s
BM_generic<TestZeroingAsync, char>/134217728/real_time                     430345 ns       425946 ns         1601 Bandwidth=311.884G/s
BM_generic<TestZeroingAsync, int>/1024/real_time                            38761 ns        38033 ns        18963 Bandwidth=105.674M/s
BM_generic<TestZeroingAsync, int>/4096/real_time                            39799 ns        39233 ns        17282 Bandwidth=411.67M/s
BM_generic<TestZeroingAsync, int>/32768/real_time                           64449 ns        63481 ns        11561 Bandwidth=2.03374G/s
BM_generic<TestZeroingAsync, int>/262144/real_time                         107034 ns       105251 ns         6405 Bandwidth=9.7967G/s
BM_generic<TestZeroingAsync, int>/2097152/real_time                        114558 ns       112508 ns         5871 Bandwidth=73.2256G/s
BM_generic<TestZeroingAsync, int>/16777216/real_time                       183025 ns       178901 ns         3928 Bandwidth=366.665G/s
BM_generic<TestZeroingAsync, int>/134217728/real_time                      856791 ns       856792 ns          984 Bandwidth=626.607G/s
BM_generic<TestZeroingAsync, double>/1024/real_time                         39056 ns        38290 ns        18056 Bandwidth=209.751M/s
BM_generic<TestZeroingAsync, double>/4096/real_time                         40921 ns        40080 ns        17359 Bandwidth=800.757M/s
BM_generic<TestZeroingAsync, double>/32768/real_time                        64882 ns        63681 ns        11589 Bandwidth=4.04031G/s
BM_generic<TestZeroingAsync, double>/262144/real_time                      112082 ns       109774 ns         6344 Bandwidth=18.7108G/s
BM_generic<TestZeroingAsync, double>/2097152/real_time                     121507 ns       119523 ns         5482 Bandwidth=138.076G/s
BM_generic<TestZeroingAsync, double>/16777216/real_time                    267577 ns       263030 ns         2618 Bandwidth=501.605G/s
BM_generic<TestZeroingAsync, double>/134217728/real_time                  1526288 ns      1526268 ns          474 Bandwidth=703.499G/s
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time           38006 ns        37489 ns        18248 Bandwidth=431.086M/s
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time           42246 ns        41293 ns        16968 Bandwidth=1.55128G/s
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time          80288 ns        78316 ns         9388 Bandwidth=6.53011G/s
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time        151297 ns       147956 ns         4666 Bandwidth=27.7223G/s
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time       246672 ns       242352 ns         2880 Bandwidth=136.028G/s
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time     1076434 ns      1076410 ns          640 Bandwidth=249.375G/s
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time    6813501 ns      6813004 ns           93 Bandwidth=315.181G/s

after

BM_generic<TestZeroing, char>/1024/real_time                                13848 ns        13846 ns        49021 Bandwidth=73.9451M/s
BM_generic<TestZeroing, char>/4096/real_time                                15106 ns        15106 ns        46342 Bandwidth=271.143M/s
BM_generic<TestZeroing, char>/32768/real_time                               19196 ns        19196 ns        35968 Bandwidth=1.70702G/s
BM_generic<TestZeroing, char>/262144/real_time                              19314 ns        19310 ns        36670 Bandwidth=13.5729G/s
BM_generic<TestZeroing, char>/2097152/real_time                             20584 ns        20584 ns        34211 Bandwidth=101.884G/s
BM_generic<TestZeroing, char>/16777216/real_time                            38293 ns        38293 ns        18294 Bandwidth=438.129G/s
BM_generic<TestZeroing, char>/134217728/real_time                          171561 ns       171561 ns         4090 Bandwidth=782.333G/s
BM_generic<TestZeroing, int>/1024/real_time                                 14727 ns        14727 ns        46795 Bandwidth=278.125M/s
BM_generic<TestZeroing, int>/4096/real_time                                 19209 ns        19205 ns        36127 Bandwidth=852.919M/s
BM_generic<TestZeroing, int>/32768/real_time                                19226 ns        19226 ns        36383 Bandwidth=6.81738G/s
BM_generic<TestZeroing, int>/262144/real_time                               19558 ns        19558 ns        35347 Bandwidth=53.6138G/s
BM_generic<TestZeroing, int>/2097152/real_time                              28903 ns        28904 ns        23891 Bandwidth=290.229G/s
BM_generic<TestZeroing, int>/16777216/real_time                             95265 ns        95267 ns         7375 Bandwidth=704.446G/s
BM_generic<TestZeroing, int>/134217728/real_time                           625759 ns       625759 ns         1101 Bandwidth=857.952G/s
BM_generic<TestZeroing, double>/1024/real_time                              18981 ns        18978 ns        36644 Bandwidth=431.601M/s
BM_generic<TestZeroing, double>/4096/real_time                              19042 ns        19042 ns        36613 Bandwidth=1.72083G/s
BM_generic<TestZeroing, double>/32768/real_time                             18880 ns        18879 ns        36952 Bandwidth=13.8846G/s
BM_generic<TestZeroing, double>/262144/real_time                            20289 ns        20289 ns        34832 Bandwidth=103.363G/s
BM_generic<TestZeroing, double>/2097152/real_time                           38045 ns        38045 ns        18433 Bandwidth=440.987G/s
BM_generic<TestZeroing, double>/16777216/real_time                         171245 ns       171218 ns         4098 Bandwidth=783.777G/s
BM_generic<TestZeroing, double>/134217728/real_time                       1227537 ns      1227466 ns          560 Bandwidth=874.712G/s
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                19093 ns        19093 ns        36691 Bandwidth=858.109M/s
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                19248 ns        19247 ns        36672 Bandwidth=3.40491G/s
BM_generic<TestZeroing, std::complex<double>>/32768/real_time               19212 ns        19184 ns        36041 Bandwidth=27.2902G/s
BM_generic<TestZeroing, std::complex<double>>/262144/real_time              22242 ns        22242 ns        32662 Bandwidth=188.574G/s
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time             57476 ns        57460 ns        12105 Bandwidth=583.795G/s
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time           323056 ns       323011 ns         2171 Bandwidth=830.925G/s
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time         2440934 ns      2440884 ns          283 Bandwidth=879.779G/s
BM_generic<TestZeroingAsync, char>/1024/real_time                            7395 ns         7395 ns        95221 Bandwidth=138.464M/s
BM_generic<TestZeroingAsync, char>/4096/real_time                            7136 ns         7135 ns        98205 Bandwidth=574.013M/s
BM_generic<TestZeroingAsync, char>/32768/real_time                          10670 ns        10670 ns        66113 Bandwidth=3.07102G/s
BM_generic<TestZeroingAsync, char>/262144/real_time                         10680 ns        10680 ns        66068 Bandwidth=24.5461G/s
BM_generic<TestZeroingAsync, char>/2097152/real_time                        10589 ns        10589 ns        67137 Bandwidth=198.058G/s
BM_generic<TestZeroingAsync, char>/16777216/real_time                       20504 ns        20502 ns        36963 Bandwidth=818.261G/s
BM_generic<TestZeroingAsync, char>/134217728/real_time                     136642 ns       136641 ns        10000 Bandwidth=982.26G/s
BM_generic<TestZeroingAsync, int>/1024/real_time                             7088 ns         7088 ns        99784 Bandwidth=577.858M/s
BM_generic<TestZeroingAsync, int>/4096/real_time                            10716 ns        10716 ns        65316 Bandwidth=1.52895G/s
BM_generic<TestZeroingAsync, int>/32768/real_time                           10520 ns        10520 ns        64999 Bandwidth=12.4592G/s
BM_generic<TestZeroingAsync, int>/262144/real_time                          10616 ns        10616 ns        66165 Bandwidth=98.7703G/s
BM_generic<TestZeroingAsync, int>/2097152/real_time                         11393 ns        11393 ns        63864 Bandwidth=736.309G/s
BM_generic<TestZeroingAsync, int>/16777216/real_time                        69515 ns        69512 ns        10000 Bandwidth=965.389G/s
BM_generic<TestZeroingAsync, int>/134217728/real_time                      540240 ns       540233 ns        10000 Bandwidth=993.764G/s
BM_generic<TestZeroingAsync, double>/1024/real_time                         10551 ns        10551 ns        68808 Bandwidth=776.418M/s
BM_generic<TestZeroingAsync, double>/4096/real_time                         10511 ns        10511 ns        66806 Bandwidth=3.11739G/s
BM_generic<TestZeroingAsync, double>/32768/real_time                        10557 ns        10557 ns        65498 Bandwidth=24.8302G/s
BM_generic<TestZeroingAsync, double>/262144/real_time                       10508 ns        10507 ns        66873 Bandwidth=199.585G/s
BM_generic<TestZeroingAsync, double>/2097152/real_time                      20543 ns        20543 ns        36958 Bandwidth=816.683G/s
BM_generic<TestZeroingAsync, double>/16777216/real_time                    136669 ns       136582 ns        10000 Bandwidth=982.067G/s
BM_generic<TestZeroingAsync, double>/134217728/real_time                  1081450 ns      1080327 ns        10000 Bandwidth=992.872G/s
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time           10437 ns        10437 ns        68004 Bandwidth=1.56979G/s
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time           10123 ns        10123 ns        66034 Bandwidth=6.47376G/s
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time           9968 ns         9968 ns        70075 Bandwidth=52.5946G/s
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time         10279 ns        10279 ns        70160 Bandwidth=408.055G/s
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time        37865 ns        37865 ns        19547 Bandwidth=886.167G/s
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time      271248 ns       271243 ns        10000 Bandwidth=989.632G/s
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time    2154484 ns      2154348 ns        10000 Bandwidth=996.751G/s

@masterleinad
Copy link
Contributor Author

For HIP I am seeing (old)

BM_generic<TestZeroing, char>/1024/real_time                               23008 ns        22948 ns        30507 Bandwidth=44.5071M/s
BM_generic<TestZeroing, char>/4096/real_time                               22935 ns        22876 ns        30544 Bandwidth=178.595M/s
BM_generic<TestZeroing, char>/32768/real_time                              25825 ns        25749 ns        27160 Bandwidth=1.26884G/s
BM_generic<TestZeroing, char>/262144/real_time                             37213 ns        37116 ns        18814 Bandwidth=7.04448G/s
BM_generic<TestZeroing, char>/2097152/real_time                            47231 ns        47106 ns        14868 Bandwidth=44.4019G/s
BM_generic<TestZeroing, char>/16777216/real_time                          163900 ns       163467 ns         4269 Bandwidth=102.362G/s
BM_generic<TestZeroing, int>/1024/real_time                                27041 ns        26970 ns        25508 Bandwidth=151.472M/s
BM_generic<TestZeroing, int>/4096/real_time                                26984 ns        26921 ns        25866 Bandwidth=607.172M/s
BM_generic<TestZeroing, int>/32768/real_time                               30041 ns        29968 ns        23191 Bandwidth=4.36313G/s
BM_generic<TestZeroing, int>/262144/real_time                              38306 ns        38214 ns        18329 Bandwidth=27.3736G/s
BM_generic<TestZeroing, int>/2097152/real_time                             48791 ns        48679 ns        14152 Bandwidth=171.931G/s
BM_generic<TestZeroing, int>/16777216/real_time                           165263 ns       164863 ns         4244 Bandwidth=406.072G/s
BM_generic<TestZeroing, double>/1024/real_time                             27392 ns        27327 ns        25734 Bandwidth=299.069M/s
BM_generic<TestZeroing, double>/4096/real_time                             27228 ns        27161 ns        25680 Bandwidth=1.20348G/s
BM_generic<TestZeroing, double>/32768/real_time                            30457 ns        30381 ns        22956 Bandwidth=8.60708G/s
BM_generic<TestZeroing, double>/262144/real_time                           39433 ns        39343 ns        17745 Bandwidth=53.1823G/s
BM_generic<TestZeroing, double>/2097152/real_time                          49422 ns        49298 ns        14058 Bandwidth=339.468G/s
BM_generic<TestZeroing, double>/16777216/real_time                        166661 ns       166203 ns         4206 Bandwidth=805.333G/s
BM_generic<TestZeroing, std::complex<double>>/1024/real_time               27403 ns        27336 ns        25608 Bandwidth=597.897M/s
BM_generic<TestZeroing, std::complex<double>>/4096/real_time               27216 ns        27149 ns        25524 Bandwidth=2.40803G/s
BM_generic<TestZeroing, std::complex<double>>/32768/real_time              31007 ns        30932 ns        22762 Bandwidth=16.9087G/s
BM_generic<TestZeroing, std::complex<double>>/262144/real_time             40592 ns        40475 ns        17096 Bandwidth=103.329G/s
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time            64177 ns        64022 ns        10956 Bandwidth=522.844G/s
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time          319286 ns       318486 ns         2188 Bandwidth=840.735G/s

and (new)

BM_generic<TestZeroing, char>/1024/real_time                               20362 ns        20315 ns        34476 Bandwidth=50.2886M/s
BM_generic<TestZeroing, char>/4096/real_time                               23031 ns        22972 ns        35049 Bandwidth=177.851M/s
BM_generic<TestZeroing, char>/32768/real_time                              26001 ns        25922 ns        26875 Bandwidth=1.26025G/s
BM_generic<TestZeroing, char>/262144/real_time                             34535 ns        34444 ns        20186 Bandwidth=7.59064G/s
BM_generic<TestZeroing, char>/2097152/real_time                            74896 ns        74693 ns         9425 Bandwidth=28.001G/s
BM_generic<TestZeroing, char>/16777216/real_time                           47305 ns        47186 ns        14725 Bandwidth=354.658G/s
BM_generic<TestZeroing, int>/1024/real_time                                24332 ns        24260 ns        28972 Bandwidth=168.337M/s
BM_generic<TestZeroing, int>/4096/real_time                                25186 ns        25121 ns        27667 Bandwidth=650.524M/s
BM_generic<TestZeroing, int>/32768/real_time                               30114 ns        30047 ns        22990 Bandwidth=4.35249G/s
BM_generic<TestZeroing, int>/262144/real_time                              35315 ns        35234 ns        19842 Bandwidth=29.6925G/s
BM_generic<TestZeroing, int>/2097152/real_time                             39693 ns        39591 ns        17628 Bandwidth=211.335G/s
BM_generic<TestZeroing, int>/16777216/real_time                            97903 ns        97676 ns         7152 Bandwidth=685.461G/s
BM_generic<TestZeroing, double>/1024/real_time                             24350 ns        24293 ns        28860 Bandwidth=336.426M/s
BM_generic<TestZeroing, double>/4096/real_time                             25700 ns        25637 ns        27257 Bandwidth=1.27504G/s
BM_generic<TestZeroing, double>/32768/real_time                            34162 ns        34074 ns        20510 Bandwidth=7.67352G/s
BM_generic<TestZeroing, double>/262144/real_time                           74865 ns        74691 ns         9346 Bandwidth=28.0125G/s
BM_generic<TestZeroing, double>/2097152/real_time                          47579 ns        47473 ns        14700 Bandwidth=352.621G/s
BM_generic<TestZeroing, double>/16777216/real_time                        165316 ns       164938 ns         4236 Bandwidth=811.886G/s
BM_generic<TestZeroing, std::complex<double>>/1024/real_time               24869 ns        24803 ns        28095 Bandwidth=658.819M/s
BM_generic<TestZeroing, std::complex<double>>/4096/real_time               26976 ns        26913 ns        25924 Bandwidth=2.42945G/s
BM_generic<TestZeroing, std::complex<double>>/32768/real_time              34613 ns        34534 ns        20268 Bandwidth=15.147G/s
BM_generic<TestZeroing, std::complex<double>>/262144/real_time             38371 ns        38267 ns        18361 Bandwidth=109.31G/s
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time            64258 ns        64092 ns        10901 Bandwidth=522.184G/s
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time          333283 ns       332492 ns         2154 Bandwidth=805.429G/s

@dalg24
Copy link
Member

dalg24 commented Apr 15, 2021

Please use the benchmark comparison tool when you post results

@masterleinad
Copy link
Contributor Author

CUDA:

Comparing old_cuda.json to new_cuda.json
Benchmark                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_generic<TestZeroing, char>/1024/real_time                                          -0.3360         -0.3351         16099         10690         16075         10688
BM_generic<TestZeroing, char>/4096/real_time                                          -0.2409         -0.2410         15722         11934         15721         11933
BM_generic<TestZeroing, char>/32768/real_time                                         -0.1227         -0.1213         17019         14930         16990         14928
BM_generic<TestZeroing, char>/262144/real_time                                        -0.3424         -0.3424         23085         15181         23083         15180
BM_generic<TestZeroing, char>/2097152/real_time                                       -0.7518         -0.7518         67405         16730         67405         16727
BM_generic<TestZeroing, char>/16777216/real_time                                      -0.9202         -0.9200        429138         34265        428464         34262
BM_generic<TestZeroing, char>/134217728/real_time                                     -0.9496         -0.9496       3326210        167649       3324875        167624
BM_generic<TestZeroing, int>/1024/real_time                                           -0.2223         -0.2223         15637         12161         15635         12159
BM_generic<TestZeroing, int>/4096/real_time                                           -0.0767         -0.0753         16069         14837         16042         14833
BM_generic<TestZeroing, int>/32768/real_time                                          -0.1529         -0.1529         17466         14797         17464         14795
BM_generic<TestZeroing, int>/262144/real_time                                         -0.3437         -0.3428         24180         15868         24142         15865
BM_generic<TestZeroing, int>/2097152/real_time                                        -0.6415         -0.6415         69160         24790         69155         24790
BM_generic<TestZeroing, int>/16777216/real_time                                       -0.7872         -0.7869        431676         91875        431057         91873
BM_generic<TestZeroing, int>/134217728/real_time                                      -0.8113         -0.8113       3329559        628260       3329549        628227
BM_generic<TestZeroing, double>/1024/real_time                                        -0.0846         -0.0846         15843         14503         15843         14503
BM_generic<TestZeroing, double>/4096/real_time                                        -0.0817         -0.0815         15973         14668         15970         14668
BM_generic<TestZeroing, double>/32768/real_time                                       -0.1351         -0.1350         17000         14704         16999         14704
BM_generic<TestZeroing, double>/262144/real_time                                      -0.2731         -0.2730         22465         16331         22462         16331
BM_generic<TestZeroing, double>/2097152/real_time                                     -0.4925         -0.4924         67864         34439         67853         34439
BM_generic<TestZeroing, double>/16777216/real_time                                    -0.6095         -0.6094        430252        168033        430226        168031
BM_generic<TestZeroing, double>/134217728/real_time                                   -0.6273         -0.6272       3322010       1238053       3321360       1238054
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                          -0.0367         -0.0367         15529         14959         15529         14959
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                          -0.0798         -0.0798         15992         14716         15991         14715
BM_generic<TestZeroing, std::complex<double>>/32768/real_time                         -0.0951         -0.0949         16864         15260         16860         15260
BM_generic<TestZeroing, std::complex<double>>/262144/real_time                        -0.2360         -0.2360         23105         17653         23104         17653
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time                       -0.4847         -0.4847        104112         53650        104110         53650
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time                      -0.5544         -0.5543        723291        322322        723159        322309
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time                     -0.5471         -0.5471       5498376       2490369       5498384       2490375
BM_generic<TestZeroingAsync, char>/1024/real_time                                     -0.3654         -0.3654         13966          8863         13963          8861
BM_generic<TestZeroingAsync, char>/4096/real_time                                     -0.2688         -0.2687         13881         10150         13880         10150
BM_generic<TestZeroingAsync, char>/32768/real_time                                    -0.1331         -0.1330         14935         12947         14934         12948
BM_generic<TestZeroingAsync, char>/262144/real_time                                   -0.4216         -0.4209         22137         12804         22107         12803
BM_generic<TestZeroingAsync, char>/2097152/real_time                                  -0.7949         -0.7949         71094         14579         71088         14579
BM_generic<TestZeroingAsync, char>/16777216/real_time                                 -0.9250         -0.9250        435230         32643        435226         32643
BM_generic<TestZeroingAsync, char>/134217728/real_time                                -0.9504         -0.9505       3360259        166550       3360144        166313
BM_generic<TestZeroingAsync, int>/1024/real_time                                      -0.2663         -0.2656         13863         10172         13851         10171
BM_generic<TestZeroingAsync, int>/4096/real_time                                      -0.0435         -0.0449         13748         13149         13746         13129
BM_generic<TestZeroingAsync, int>/32768/real_time                                     -0.1346         -0.1346         14919         12911         14919         12911
BM_generic<TestZeroingAsync, int>/262144/real_time                                    -0.3473         -0.3473         20802         13577         20802         13577
BM_generic<TestZeroingAsync, int>/2097152/real_time                                   -0.6681         -0.6680         66955         22225         66945         22226
BM_generic<TestZeroingAsync, int>/16777216/real_time                                  -0.7943         -0.7943        431380         88743        431340         88740
BM_generic<TestZeroingAsync, int>/134217728/real_time                                 -0.8148         -0.8148       3353594        621080       3353594        621080
BM_generic<TestZeroingAsync, double>/1024/real_time                                   -0.1103         -0.1103         13969         12428         13969         12428
BM_generic<TestZeroingAsync, double>/4096/real_time                                   -0.0815         -0.0815         13528         12425         13528         12425
BM_generic<TestZeroingAsync, double>/32768/real_time                                  -0.1469         -0.1471         14600         12455         14600         12452
BM_generic<TestZeroingAsync, double>/262144/real_time                                 -0.3082         -0.3082         20108         13911         20107         13910
BM_generic<TestZeroingAsync, double>/2097152/real_time                                -0.5112         -0.5112         66583         32547         66582         32546
BM_generic<TestZeroingAsync, double>/16777216/real_time                               -0.6107         -0.6107        430416        167570        430411        167567
BM_generic<TestZeroingAsync, double>/134217728/real_time                              -0.6326         -0.6326       3359315       1234214       3359238       1234215
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time                     -0.1263         -0.1250         14500         12669         14477         12668
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time                     -0.0966         -0.0953         14084         12724         14064         12724
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time                    -0.1362         -0.1362         15406         13308         15406         13308
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time                   -0.2767         -0.2757         21855         15808         21825         15808
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time                  -0.4769         -0.4769        100310         52469        100303         52469
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time                 -0.5450         -0.5451        704283        320420        704266        320403
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time                -0.5539         -0.5538       5572643       2486207       5572312       2486212
BM_generic<TestOne, char>/1024/real_time                                              +0.0200         +0.0216         16462         16791         16436         16791
BM_generic<TestOne, char>/4096/real_time                                              -0.0107         -0.0107         16665         16487         16665         16487
BM_generic<TestOne, char>/32768/real_time                                             -0.0308         -0.0295         17544         17003         17518         17002
BM_generic<TestOne, char>/262144/real_time                                            -0.0667         -0.0667         24471         22838         24471         22838
BM_generic<TestOne, char>/2097152/real_time                                           -0.0168         -0.0168         69760         68590         69761         68590
BM_generic<TestOne, char>/16777216/real_time                                          -0.0135         -0.0135        439503        433583        439503        433584
BM_generic<TestOne, char>/134217728/real_time                                         -0.0177         -0.0177       3415167       3354625       3414940       3354338
BM_generic<TestOne, int>/1024/real_time                                               -0.0157         -0.0157         16586         16325         16586         16325
BM_generic<TestOne, int>/4096/real_time                                               -0.0250         -0.0250         16933         16510         16933         16510
BM_generic<TestOne, int>/32768/real_time                                              -0.0468         -0.0468         17639         16814         17639         16814
BM_generic<TestOne, int>/262144/real_time                                             +0.0501         +0.0501         23268         24434         23267         24433
BM_generic<TestOne, int>/2097152/real_time                                            +0.0051         +0.0053         69048         69402         69039         69402
BM_generic<TestOne, int>/16777216/real_time                                           +0.0031         +0.0032        433676        435016        433642        435016
BM_generic<TestOne, int>/134217728/real_time                                          -0.0032         -0.0032       3372591       3361688       3372592       3361696
BM_generic<TestOne, double>/1024/real_time                                            +0.0396         +0.0395         16104         16742         16104         16741
BM_generic<TestOne, double>/4096/real_time                                            +0.0477         +0.0477         16289         17066         16289         17066
BM_generic<TestOne, double>/32768/real_time                                           +0.0160         +0.0160         17121         17395         17121         17395
BM_generic<TestOne, double>/262144/real_time                                          +0.0240         +0.0241         22805         23352         22803         23352
BM_generic<TestOne, double>/2097152/real_time                                         +0.0113         +0.0113         68459         69231         68459         69231
BM_generic<TestOne, double>/16777216/real_time                                        +0.0057         +0.0058        431631        434109        431609        434103
BM_generic<TestOne, double>/134217728/real_time                                       +0.0045         +0.0045       3360179       3375306       3360131       3375253
BM_generic<TestOne, std::complex<double>>/1024/real_time                              +0.0149         +0.0149         16898         17149         16898         17149
BM_generic<TestOne, std::complex<double>>/4096/real_time                              +0.0128         +0.0129         16501         16712         16499         16712
BM_generic<TestOne, std::complex<double>>/32768/real_time                             -0.0162         -0.0162         17985         17693         17983         17692
BM_generic<TestOne, std::complex<double>>/262144/real_time                            +0.0364         +0.0364         23186         24030         23186         24030
BM_generic<TestOne, std::complex<double>>/2097152/real_time                           +0.0173         +0.0173        100102        101831        100103        101831
BM_generic<TestOne, std::complex<double>>/16777216/real_time                          +0.0080         +0.0080        701019        706644        701021        706646
BM_generic<TestOne, std::complex<double>>/134217728/real_time                         +0.0066         +0.0066       5529357       5565962       5529509       5565895
BM_generic<TestOneAsync, char>/1024/real_time                                         +0.0592         +0.0593         13487         14286         13486         14286
BM_generic<TestOneAsync, char>/4096/real_time                                         +0.0505         +0.0505         13500         14182         13500         14182
BM_generic<TestOneAsync, char>/32768/real_time                                        +0.0266         +0.0266         14804         15197         14804         15197
BM_generic<TestOneAsync, char>/262144/real_time                                       +0.0513         +0.0513         20298         21339         20298         21339
BM_generic<TestOneAsync, char>/2097152/real_time                                      +0.0034         +0.0034         66466         66690         66460         66686
BM_generic<TestOneAsync, char>/16777216/real_time                                     +0.0055         +0.0055        428782        431147        428783        431148
BM_generic<TestOneAsync, char>/134217728/real_time                                    +0.0080         +0.0080       3318622       3345184       3318616       3345181
BM_generic<TestOneAsync, int>/1024/real_time                                          +0.0883         +0.0883         13316         14492         13316         14492
BM_generic<TestOneAsync, int>/4096/real_time                                          +0.1005         +0.1005         13161         14484         13160         14483
BM_generic<TestOneAsync, int>/32768/real_time                                         +0.1130         +0.1130         13911         15483         13911         15483
BM_generic<TestOneAsync, int>/262144/real_time                                        +0.0814         +0.0815         19695         21299         19695         21299
BM_generic<TestOneAsync, int>/2097152/real_time                                       +0.0128         +0.0128         66410         67259         66410         67259
BM_generic<TestOneAsync, int>/16777216/real_time                                      +0.0085         +0.0085        428534        432187        428535        432160
BM_generic<TestOneAsync, int>/134217728/real_time                                     +0.0141         +0.0142       3304596       3351356       3304600       3351368
BM_generic<TestOneAsync, double>/1024/real_time                                       +0.0956         +0.0956         13346         14622         13346         14622
BM_generic<TestOneAsync, double>/4096/real_time                                       +0.0796         +0.0796         13297         14355         13297         14355
BM_generic<TestOneAsync, double>/32768/real_time                                      +0.1165         +0.1165         14429         16109         14429         16109
BM_generic<TestOneAsync, double>/262144/real_time                                     +0.0516         +0.0517         20153         21193         20152         21193
BM_generic<TestOneAsync, double>/2097152/real_time                                    +0.0069         +0.0069         66583         67043         66583         67043
BM_generic<TestOneAsync, double>/16777216/real_time                                   +0.0071         +0.0071        427908        430957        427903        430957
BM_generic<TestOneAsync, double>/134217728/real_time                                  +0.0104         +0.0104       3306783       3341103       3306713       3341107
BM_generic<TestOneAsync, std::complex<double>>/1024/real_time                         +0.0646         +0.0646         13593         14471         13593         14470
BM_generic<TestOneAsync, std::complex<double>>/4096/real_time                         +0.0518         +0.0518         13672         14380         13672         14380
BM_generic<TestOneAsync, std::complex<double>>/32768/real_time                        +0.0433         +0.0433         14861         15505         14861         15505
BM_generic<TestOneAsync, std::complex<double>>/262144/real_time                       +0.0224         +0.0222         21119         21592         21119         21587
BM_generic<TestOneAsync, std::complex<double>>/2097152/real_time                      +0.0086         +0.0085         98818         99667         98818         99661
BM_generic<TestOneAsync, std::complex<double>>/16777216/real_time                     +0.0101         +0.0102        696967        704030        696937        704031
BM_generic<TestOneAsync, std::complex<double>>/134217728/real_time                    +0.0087         +0.0087       5499750       5547518       5499754       5547515

@masterleinad
Copy link
Contributor Author

SYCL:

Comparing old_sycl.json to new_sycl.json
Benchmark                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_generic<TestZeroing, char>/1024/real_time                                          -0.6612         -0.6524         43608         14772         42487         14768
BM_generic<TestZeroing, char>/4096/real_time                                          -0.6382         -0.6308         42919         15529         42043         15524
BM_generic<TestZeroing, char>/32768/real_time                                         -0.7075         -0.7024         69567         20346         68243         20308
BM_generic<TestZeroing, char>/262144/real_time                                        -0.8042         -0.7992        114024         22324        111028         22299
BM_generic<TestZeroing, char>/2097152/real_time                                       -0.7939         -0.7902        115068         23718        113059         23716
BM_generic<TestZeroing, char>/16777216/real_time                                      -0.7239         -0.7182        149876         41380        146765         41362
BM_generic<TestZeroing, char>/134217728/real_time                                     -0.5923         -0.5877        438587        178797        433499        178748
BM_generic<TestZeroing, int>/1024/real_time                                           -0.6154         -0.6078         43100         16577         42258         16572
BM_generic<TestZeroing, int>/4096/real_time                                           -0.5199         -0.5117         44158         21200         43405         21196
BM_generic<TestZeroing, int>/32768/real_time                                          -0.7291         -0.7235         69845         18923         68409         18917
BM_generic<TestZeroing, int>/262144/real_time                                         -0.8312         -0.8283        112796         19043        110872         19041
BM_generic<TestZeroing, int>/2097152/real_time                                        -0.7667         -0.7616        119559         27898        117023         27898
BM_generic<TestZeroing, int>/16777216/real_time                                       -0.4928         -0.4828        189309         96024        185667         96024
BM_generic<TestZeroing, int>/134217728/real_time                                      -0.2545         -0.2537        845958        630673        844843        630542
BM_generic<TestZeroing, double>/1024/real_time                                        -0.5447         -0.5347         41199         18756         40309         18756
BM_generic<TestZeroing, double>/4096/real_time                                        -0.5485         -0.5413         42197         19052         41532         19051
BM_generic<TestZeroing, double>/32768/real_time                                       -0.7146         -0.7088         67167         19169         65810         19167
BM_generic<TestZeroing, double>/262144/real_time                                      -0.8085         -0.8046        108622         20800        106459         20799
BM_generic<TestZeroing, double>/2097152/real_time                                     -0.6933         -0.6873        123969         38017        121591         38016
BM_generic<TestZeroing, double>/16777216/real_time                                    -0.3538         -0.3435        267178        172640        262935        172624
BM_generic<TestZeroing, double>/134217728/real_time                                   -0.1974         -0.1963       1548664       1242915       1546500       1242881
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                          -0.5189         -0.5106         40587         19528         39902         19528
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                          -0.5585         -0.5521         44032         19440         43405         19440
BM_generic<TestZeroing, std::complex<double>>/32768/real_time                         -0.7607         -0.7572         81302         19459         80152         19457
BM_generic<TestZeroing, std::complex<double>>/262144/real_time                        -0.8498         -0.8477        145198         21814        143215         21814
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time                       -0.7640         -0.7595        245294         57901        240730         57900
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time                      -0.6974         -0.6974       1086969        328940       1086947        328940
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time                     -0.6384         -0.6389       7089023       2563219       7089059       2560119
BM_generic<TestZeroingAsync, char>/1024/real_time                                     -0.5571         -0.5508         40352         17874         39774         17866
BM_generic<TestZeroingAsync, char>/4096/real_time                                     -0.5850         -0.5784         41360         17165         40703         17160
BM_generic<TestZeroingAsync, char>/32768/real_time                                    -0.6547         -0.6481         66246         22873         64971         22867
BM_generic<TestZeroingAsync, char>/262144/real_time                                   -0.7896         -0.7862        109315         22997        107513         22990
BM_generic<TestZeroingAsync, char>/2097152/real_time                                  -0.7870         -0.7834        115327         24568        113372         24558
BM_generic<TestZeroingAsync, char>/16777216/real_time                                 -0.7295         -0.7248        155976         42191        153233         42175
BM_generic<TestZeroingAsync, char>/134217728/real_time                                -0.6128         -0.6073        458962        177719        452446        177663
BM_generic<TestZeroingAsync, int>/1024/real_time                                      -0.6193         -0.6128         43357         16508         42620         16501
BM_generic<TestZeroingAsync, int>/4096/real_time                                      -0.4914         -0.4832         40879         20793         40230         20790
BM_generic<TestZeroingAsync, int>/32768/real_time                                     -0.6855         -0.6797         65607         20635         64417         20632
BM_generic<TestZeroingAsync, int>/262144/real_time                                    -0.8154         -0.8122        112612         20788        110695         20785
BM_generic<TestZeroingAsync, int>/2097152/real_time                                   -0.7568         -0.7514        124023         30165        121309         30159
BM_generic<TestZeroingAsync, int>/16777216/real_time                                  -0.4927         -0.4832        191694         97237        188059         97192
BM_generic<TestZeroingAsync, int>/134217728/real_time                                 -0.2830         -0.2831        876644        628539        876502        628344
BM_generic<TestZeroingAsync, double>/1024/real_time                                   -0.4668         -0.4596         39576         21102         38901         21023
BM_generic<TestZeroingAsync, double>/4096/real_time                                   -0.5074         -0.4969         42439         20906         41541         20901
BM_generic<TestZeroingAsync, double>/32768/real_time                                  -0.6850         -0.6786         66344         20901         64906         20860
BM_generic<TestZeroingAsync, double>/262144/real_time                                 -0.7944         -0.7899        108045         22214        105684         22206
BM_generic<TestZeroingAsync, double>/2097152/real_time                                -0.6829         -0.6775        125385         39760        123238         39748
BM_generic<TestZeroingAsync, double>/16777216/real_time                               -0.3523         -0.3424        268066        173625        263955        173566
BM_generic<TestZeroingAsync, double>/134217728/real_time                              -0.2575         -0.2578       1664325       1235767       1664324       1235336
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time                     -0.4673         -0.4586         38619         20571         37979         20563
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time                     -0.4922         -0.4836         40499         20564         39817         20561
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time                    -0.7386         -0.7339         78832         20603         77420         20601
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time                   -0.8326         -0.8297        140202         23475        137815         23472
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time                  -0.7524         -0.7472        239297         59260        234300         59242
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time                 -0.7079         -0.7079       1122886        327954       1122372        327816
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time                -0.6468         -0.6469       7055010       2491622       7055034       2491212
BM_generic<TestOne, char>/1024/real_time                                              -0.0153         -0.0184         40713         40089         40044         39309
BM_generic<TestOne, char>/4096/real_time                                              +0.0042         -0.0025         42916         43094         42402         42298
BM_generic<TestOne, char>/32768/real_time                                             +0.0350         +0.0295         65894         68198         64861         66772
BM_generic<TestOne, char>/262144/real_time                                            +0.2160         +0.2088        103987        126447        102642        124072
BM_generic<TestOne, char>/2097152/real_time                                           +0.0449         +0.0440        103882        108548        101966        106448
BM_generic<TestOne, char>/16777216/real_time                                          +0.0427         +0.0482        139912        145893        137249        143871
BM_generic<TestOne, char>/134217728/real_time                                         +0.0159         +0.0186        431985        438841        426507        434442
BM_generic<TestOne, int>/1024/real_time                                               +0.0660         +0.0616         39363         41961         38754         41142
BM_generic<TestOne, int>/4096/real_time                                               +0.0273         +0.0275         40703         41814         39932         41031
BM_generic<TestOne, int>/32768/real_time                                              +0.0353         +0.0365         63876         66132         62667         64953
BM_generic<TestOne, int>/262144/real_time                                             +0.0634         +0.0662        103387        109937        101366        108075
BM_generic<TestOne, int>/2097152/real_time                                            -0.0009         -0.0046        114954        114848        112907        112392
BM_generic<TestOne, int>/16777216/real_time                                           +0.0208         +0.0165        182715        186522        180647        183630
BM_generic<TestOne, int>/134217728/real_time                                          -0.0729         -0.0726        918753        851795        918503        851785
BM_generic<TestOne, double>/1024/real_time                                            -0.0001         +0.0029         39456         39453         38550         38662
BM_generic<TestOne, double>/4096/real_time                                            +0.0175         +0.0205         40781         41496         39866         40684
BM_generic<TestOne, double>/32768/real_time                                           +0.0189         +0.0205         63815         65018         62570         63851
BM_generic<TestOne, double>/262144/real_time                                          +0.0096         +0.0098        103358        104355        101100        102092
BM_generic<TestOne, double>/2097152/real_time                                         +0.0039         +0.0057        121291        121768        119241        119920
BM_generic<TestOne, double>/16777216/real_time                                        -0.0055         -0.0051        264461        262995        260249        258920
BM_generic<TestOne, double>/134217728/real_time                                       -0.0556         -0.0556       1612648       1522998       1612648       1522998
BM_generic<TestOne, std::complex<double>>/1024/real_time                              -0.0134         -0.0118         42198         41634         41231         40746
BM_generic<TestOne, std::complex<double>>/4096/real_time                              -0.0721         -0.0681         47450         44029         46614         43439
BM_generic<TestOne, std::complex<double>>/32768/real_time                             -0.0909         -0.0865         88607         80555         86572         79084
BM_generic<TestOne, std::complex<double>>/262144/real_time                            -0.2303         -0.2274        154711        119087        151338        116929
BM_generic<TestOne, std::complex<double>>/2097152/real_time                           -0.0840         -0.0819        247780        226977        242984        223089
BM_generic<TestOne, std::complex<double>>/16777216/real_time                          -0.1067         -0.1068       1177550       1051892       1177525       1051819
BM_generic<TestOne, std::complex<double>>/134217728/real_time                         -0.0147         -0.0147       7025655       6922453       7025660       6922295
BM_generic<TestOneAsync, char>/1024/real_time                                         -0.1010         -0.0970         42646         38338         41574         37542
BM_generic<TestOneAsync, char>/4096/real_time                                         -0.1044         -0.1023         43951         39362         43062         38655
BM_generic<TestOneAsync, char>/32768/real_time                                        -0.0756         -0.0643         68733         63540         67081         62766
BM_generic<TestOneAsync, char>/262144/real_time                                       -0.0769         -0.0801        112611        103956        111265        102356
BM_generic<TestOneAsync, char>/2097152/real_time                                      -0.0565         -0.0536        111877        105559        109512        103642
BM_generic<TestOneAsync, char>/16777216/real_time                                     +0.0154         +0.0223        142065        144252        139099        142199
BM_generic<TestOneAsync, char>/134217728/real_time                                    +0.0427         +0.0417        430106        448455        424542        442234
BM_generic<TestOneAsync, int>/1024/real_time                                          +0.0291         +0.0265         38607         39732         38029         39036
BM_generic<TestOneAsync, int>/4096/real_time                                          -0.0188         -0.0134         40149         39394         39357         38830
BM_generic<TestOneAsync, int>/32768/real_time                                         +0.0053         -0.0003         63922         64258         62788         62770
BM_generic<TestOneAsync, int>/262144/real_time                                        -0.0180         -0.0278        108280        106330        106843        103877
BM_generic<TestOneAsync, int>/2097152/real_time                                       -0.0564         -0.0611        120231        113449        118283        111058
BM_generic<TestOneAsync, int>/16777216/real_time                                      -0.0065         -0.0075        185626        184422        182657        181293
BM_generic<TestOneAsync, int>/134217728/real_time                                     -0.0034         -0.0034        886893        883894        886866        883894
BM_generic<TestOneAsync, double>/1024/real_time                                       -0.0809         -0.0793         43038         39555         41996         38666
BM_generic<TestOneAsync, double>/4096/real_time                                       -0.0499         -0.0561         43314         41151         42518         40133
BM_generic<TestOneAsync, double>/32768/real_time                                      -0.0103         -0.0118         64772         64107         63743         62992
BM_generic<TestOneAsync, double>/262144/real_time                                     +0.0208         +0.0259        105122        107307        102914        105576
BM_generic<TestOneAsync, double>/2097152/real_time                                    +0.0429         +0.0468        121276        126482        119101        124672
BM_generic<TestOneAsync, double>/16777216/real_time                                   +0.0042         +0.0039        270667        271797        266458        267505
BM_generic<TestOneAsync, double>/134217728/real_time                                  -0.0536         -0.0549       1650496       1562085       1650496       1559858
BM_generic<TestOneAsync, std::complex<double>>/1024/real_time                         -0.0179         -0.0129         40301         39580         39410         38903
BM_generic<TestOneAsync, std::complex<double>>/4096/real_time                         +0.0294         +0.0295         42534         43786         41764         42997
BM_generic<TestOneAsync, std::complex<double>>/32768/real_time                        +0.0551         +0.0676         81431         85919         79547         84923
BM_generic<TestOneAsync, std::complex<double>>/262144/real_time                       +0.1183         +0.1198        148593        166169        145146        162537
BM_generic<TestOneAsync, std::complex<double>>/2097152/real_time                      +0.0510         +0.0537        248783        261474        244141        257254
BM_generic<TestOneAsync, std::complex<double>>/16777216/real_time                     +0.0266         +0.0252       1085237       1114122       1085236       1112604
BM_generic<TestOneAsync, std::complex<double>>/134217728/real_time                    +0.0259         +0.0244       6878552       7056948       6878197       7045838

@masterleinad
Copy link
Contributor Author

HIP

Comparing hip_old.json to hip_new.json
Benchmark                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_generic<TestZeroing, char>/1024/real_time                                          -0.0552         -0.0549         30490         28806         30462         28790
BM_generic<TestZeroing, char>/4096/real_time                                          +0.0280         +0.0279         25451         26164         25445         26155
BM_generic<TestZeroing, char>/32768/real_time                                         +0.1586         +0.1585         20238         23447         20235         23442
BM_generic<TestZeroing, char>/262144/real_time                                        -0.2036         -0.2035         34261         27286         34252         27282
BM_generic<TestZeroing, char>/2097152/real_time                                       -0.5235         -0.5236         61871         29482         61858         29471
BM_generic<TestZeroing, char>/16777216/real_time                                      -0.8065         -0.8065        268170         51886        268116         51877
BM_generic<TestZeroing, char>/134217728/real_time                                     -0.8601         -0.8601       1801869        252026       1801555        251977
BM_generic<TestZeroing, int>/1024/real_time                                           -0.2826         -0.2823         31022         22256         30999         22248
BM_generic<TestZeroing, int>/4096/real_time                                           -0.3138         -0.3135         32458         22272         32440         22268
BM_generic<TestZeroing, int>/32768/real_time                                          +0.2720         +0.2720         20771         26421         20767         26416
BM_generic<TestZeroing, int>/262144/real_time                                         +0.2354         +0.2353         24710         30526         24703         30516
BM_generic<TestZeroing, int>/2097152/real_time                                        -0.2019         -0.2020         47844         38186         47838         38175
BM_generic<TestZeroing, int>/16777216/real_time                                       -0.4907         -0.4907        262767        133828        262704        133808
BM_generic<TestZeroing, int>/134217728/real_time                                      -0.4955         -0.4955       1807420        911833       1807036        911706
BM_generic<TestZeroing, double>/1024/real_time                                        +0.2281         +0.2281         18215         22371         18209         22364
BM_generic<TestZeroing, double>/4096/real_time                                        +0.2747         +0.2747         18156         23143         18153         23140
BM_generic<TestZeroing, double>/32768/real_time                                       +0.2913         +0.2915         20992         27108         20986         27103
BM_generic<TestZeroing, double>/262144/real_time                                      -0.0883         -0.0882         32074         29243         32067         29238
BM_generic<TestZeroing, double>/2097152/real_time                                     -0.1480         -0.1480         60743         51751         60729         51738
BM_generic<TestZeroing, double>/16777216/real_time                                    -0.0377         -0.0376        261939        252076        261901        252043
BM_generic<TestZeroing, double>/134217728/real_time                                   -0.0053         -0.0053       1807058       1797515       1806767       1797253
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                          -0.1968         -0.1967         28083         22557         28077         22554
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                          -0.1166         -0.1166         27943         24685         27939         24681
BM_generic<TestZeroing, std::complex<double>>/32768/real_time                         +0.0444         +0.0444         31016         32393         31011         32388
BM_generic<TestZeroing, std::complex<double>>/262144/real_time                        +0.0495         +0.0495         34744         36465         34739         36459
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time                       +0.2151         +0.2150         69073         83928         69064         83916
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time                      +0.3024         +0.3024        369454        481168        369389        481101
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time                     +0.3401         +0.3401       2655439       3558528       2655066       3558008
BM_generic<TestZeroingAsync, char>/1024/real_time                                     +0.0315         +0.0313         23006         23730         22997         23717
BM_generic<TestZeroingAsync, char>/4096/real_time                                     +0.0684         +0.0684         22860         24425         22855         24419
BM_generic<TestZeroingAsync, char>/32768/real_time                                    +0.0007         +0.0008         25066         25083         25060         25080
BM_generic<TestZeroingAsync, char>/262144/real_time                                   +0.0031         +0.0032         28078         28164         28069         28158
BM_generic<TestZeroingAsync, char>/2097152/real_time                                  -0.4803         -0.4803         52809         27445         52802         27440
BM_generic<TestZeroingAsync, char>/16777216/real_time                                 -0.8016         -0.8017        253680         50324        253646         50306
BM_generic<TestZeroingAsync, char>/134217728/real_time                                -0.8582         -0.8583       1794901        254441       1794647        254386
BM_generic<TestZeroingAsync, int>/1024/real_time                                      -0.0823         -0.0823         22702         20834         22697         20829
BM_generic<TestZeroingAsync, int>/4096/real_time                                      -0.0860         -0.0859         22744         20788         22737         20782
BM_generic<TestZeroingAsync, int>/32768/real_time                                     +0.0010         +0.0006         25020         25046         25014         25029
BM_generic<TestZeroingAsync, int>/262144/real_time                                    +0.0105         +0.0105         28702         29004         28696         28997
BM_generic<TestZeroingAsync, int>/2097152/real_time                                   -0.3055         -0.3046         52732         36624         52656         36618
BM_generic<TestZeroingAsync, int>/16777216/real_time                                  -0.4792         -0.4792        254367        132476        254334        132456
BM_generic<TestZeroingAsync, int>/134217728/real_time                                 -0.4894         -0.4894       1794472        916269       1794236        916128
BM_generic<TestZeroingAsync, double>/1024/real_time                                   -0.0919         -0.0919         22730         20641         22726         20637
BM_generic<TestZeroingAsync, double>/4096/real_time                                   -0.0701         -0.0701         22664         21074         22658         21070
BM_generic<TestZeroingAsync, double>/32768/real_time                                  -0.0045         -0.0044         25293         25180         25287         25176
BM_generic<TestZeroingAsync, double>/262144/real_time                                 -0.0594         -0.0578         29222         27487         29168         27482
BM_generic<TestZeroingAsync, double>/2097152/real_time                                -0.0468         -0.0465         52871         50399         52840         50385
BM_generic<TestZeroingAsync, double>/16777216/real_time                               -0.0042         -0.0042        257189        256107        257142        256072
BM_generic<TestZeroingAsync, double>/134217728/real_time                              -0.0174         -0.0174       1830198       1798370       1829934       1798049
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time                     -0.1042         -0.1043         23067         20663         23063         20657
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time                     +0.0371         +0.0367         23019         23873         23016         23861
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time                    +0.1224         +0.1220         25940         29115         25935         29099
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time                   -0.0018         -0.0017         29828         29774         29819         29767
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time                  +0.2025         +0.2026         64455         77505         64440         77493
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time                 +0.3305         +0.3305        359666        478546        359617        478466
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time                +0.3387         +0.3387       2653461       3552224       2653084       3551627
BM_generic<TestOne, char>/1024/real_time                                              +0.0134         +0.0143         23432         23746         23408         23743
BM_generic<TestOne, char>/4096/real_time                                              +0.0114         +0.0123         23340         23605         23316         23602
BM_generic<TestOne, char>/32768/real_time                                             +0.0008         +0.0014         25739         25759         25719         25755
BM_generic<TestOne, char>/262144/real_time                                            +0.0414         +0.0420         28357         29529         28335         29525
BM_generic<TestOne, char>/2097152/real_time                                           +0.0025         +0.0029         53171         53304         53142         53297
BM_generic<TestOne, char>/16777216/real_time                                          +0.0068         +0.0068        257081        258827        257047        258792
BM_generic<TestOne, char>/134217728/real_time                                         +0.0096         +0.0096       1791589       1808857       1791332       1808511
BM_generic<TestOne, int>/1024/real_time                                               +0.0595         +0.0603         23224         24606         23204         24603
BM_generic<TestOne, int>/4096/real_time                                               +0.0706         +0.0713         23129         24762         23111         24758
BM_generic<TestOne, int>/32768/real_time                                              +0.0572         +0.0579         25338         26787         25317         26784
BM_generic<TestOne, int>/262144/real_time                                             +0.0400         +0.0408         29223         30393         29197         30388
BM_generic<TestOne, int>/2097152/real_time                                            +0.0223         +0.0252         52981         54163         52822         54155
BM_generic<TestOne, int>/16777216/real_time                                           +0.0131         +0.0131        255191        258531        255146        258498
BM_generic<TestOne, int>/134217728/real_time                                          +0.0010         +0.0010       1793878       1795627       1793639       1795387
BM_generic<TestOne, double>/1024/real_time                                            +0.0426         +0.0434         23425         24422         23401         24417
BM_generic<TestOne, double>/4096/real_time                                            -0.1495         -0.1489         23271         19791         23251         19788
BM_generic<TestOne, double>/32768/real_time                                           -0.1964         -0.1960         25721         20668         25703         20665
BM_generic<TestOne, double>/262144/real_time                                          -0.1724         -0.1687         29687         24569         29551         24566
BM_generic<TestOne, double>/2097152/real_time                                         -0.0717         -0.0704         56862         52788         56770         52771
BM_generic<TestOne, double>/16777216/real_time                                        -0.0508         -0.0508        268966        255292        268927        255260
BM_generic<TestOne, double>/134217728/real_time                                       -0.0069         -0.0069       1813411       1800857       1813068       1800637
BM_generic<TestOne, std::complex<double>>/1024/real_time                              -0.1149         -0.1149         33630         29767         33623         29759
BM_generic<TestOne, std::complex<double>>/4096/real_time                              -0.3257         -0.3256         27443         18505         27438         18503
BM_generic<TestOne, std::complex<double>>/32768/real_time                             -0.0755         -0.0756         27956         25845         27952         25839
BM_generic<TestOne, std::complex<double>>/262144/real_time                            -0.0527         -0.0527         33370         31611         33365         31606
BM_generic<TestOne, std::complex<double>>/2097152/real_time                           -0.0408         -0.0408         68043         65267         68034         65258
BM_generic<TestOne, std::complex<double>>/16777216/real_time                          -0.0049         -0.0049        366422        364633        366376        364575
BM_generic<TestOne, std::complex<double>>/134217728/real_time                         -0.0008         -0.0009       2659118       2656910       2658750       2656471
BM_generic<TestOneAsync, char>/1024/real_time                                         -0.0183         -0.0183         22484         22074         22481         22070
BM_generic<TestOneAsync, char>/4096/real_time                                         -0.0019         -0.0019         21986         21945         21983         21942
BM_generic<TestOneAsync, char>/32768/real_time                                        -0.0002         -0.0002         24083         24078         24080         24075
BM_generic<TestOneAsync, char>/262144/real_time                                       -0.0161         -0.0161         27521         27077         27517         27074
BM_generic<TestOneAsync, char>/2097152/real_time                                      -0.0268         -0.0268         52376         50971         52368         50964
BM_generic<TestOneAsync, char>/16777216/real_time                                     +0.0010         +0.0010        257225        257491        257191        257456
BM_generic<TestOneAsync, char>/134217728/real_time                                    -0.0025         -0.0025       1798052       1793590       1797808       1793266
BM_generic<TestOneAsync, int>/1024/real_time                                          -0.0337         -0.0336         22380         21627         22377         21624
BM_generic<TestOneAsync, int>/4096/real_time                                          -0.0216         -0.0216         22259         21779         22256         21775
BM_generic<TestOneAsync, int>/32768/real_time                                         -0.2553         -0.2552         32427         24150         32422         24147
BM_generic<TestOneAsync, int>/262144/real_time                                        +0.1213         +0.1213         24711         27709         24708         27705
BM_generic<TestOneAsync, int>/2097152/real_time                                       -0.1199         -0.1198         57727         50806         57711         50799
BM_generic<TestOneAsync, int>/16777216/real_time                                      -0.0064         -0.0064        258715        257052        258679        257019
BM_generic<TestOneAsync, int>/134217728/real_time                                     -0.0109         -0.0108       1814712       1794951       1814364       1794713
BM_generic<TestOneAsync, double>/1024/real_time                                       -0.0296         -0.0296         22685         22012         22682         22009
BM_generic<TestOneAsync, double>/4096/real_time                                       -0.0270         -0.0270         22479         21871         22475         21869
BM_generic<TestOneAsync, double>/32768/real_time                                      -0.0632         -0.0632         25333         23732         25329         23729
BM_generic<TestOneAsync, double>/262144/real_time                                     -0.0266         -0.0266         29139         28363         29135         28359
BM_generic<TestOneAsync, double>/2097152/real_time                                    -0.0283         -0.0283         52938         51441         52931         51433
BM_generic<TestOneAsync, double>/16777216/real_time                                   -0.0252         -0.0252        261210        254639        261176        254606
BM_generic<TestOneAsync, double>/134217728/real_time                                  -0.0190         -0.0190       1842843       1807830       1842584       1807585
BM_generic<TestOneAsync, std::complex<double>>/1024/real_time                         +0.2863         +0.2862         23002         29587         22999         29582
BM_generic<TestOneAsync, std::complex<double>>/4096/real_time                         +0.1272         +0.1272         22665         25549         22662         25544
BM_generic<TestOneAsync, std::complex<double>>/32768/real_time                        +0.1439         +0.1437         25623         29309         25619         29300
BM_generic<TestOneAsync, std::complex<double>>/262144/real_time                       +0.0861         +0.0861         25410         27597         25405         27594
BM_generic<TestOneAsync, std::complex<double>>/2097152/real_time                      +0.0393         +0.0393         69041         71755         69030         71743
BM_generic<TestOneAsync, std::complex<double>>/16777216/real_time                     +0.0120         +0.0120        366649        371040        366599        370991
BM_generic<TestOneAsync, std::complex<double>>/134217728/real_time                    +0.0016         +0.0016       2657338       2661565       2656989       2661112

@masterleinad
Copy link
Contributor Author

I think this is now good enough to be looked at and being discussed (in particular if the results are good enough).

@masterleinad masterleinad changed the title [WIP] Use cudaMemset in deep_copy Use cudaMemset in deep_copy Apr 16, 2021
@masterleinad
Copy link
Contributor Author

I couldn't find any OpenMPTarget memset call and didn't see any benefit on the host so I only implemented it for SYCL, HIP, and CUDA.

@masterleinad
Copy link
Contributor Author

The testing script was

#include <Kokkos_Core.hpp>
#include <benchmark/benchmark.h>

struct TestZeroing
{
  template <class ExecutionSpace, class View>
  TestZeroing(ExecutionSpace const &s, View x) {
    run(s, x);
  }

  template <class ExecutionSpace, class View>
  void run(ExecutionSpace const &, View x) {
    Kokkos::deep_copy(x, 0);
  }	
};

struct TestZeroingAsync
{
  template <class ExecutionSpace, class View>
  TestZeroingAsync(ExecutionSpace const &s, View x) {
    run(s, x);
  }

  template <class ExecutionSpace, class View>
  void run(ExecutionSpace const &s, View x) {
    Kokkos::deep_copy(s, x, 0);
    s.fence();
  }
};

struct TestOne
{
  template <class ExecutionSpace, class View>
  TestOne(ExecutionSpace const &s, View x) {
    run(s, x);
  }

  template <class ExecutionSpace, class View>
  void run(ExecutionSpace const &, View x) {
    Kokkos::deep_copy(x, 1);
  }
};

struct TestOneAsync
{
  template <class ExecutionSpace, class View>
  TestOneAsync(ExecutionSpace const &s, View x) {
    run(s, x);
  }

  template <class ExecutionSpace, class View>
  void run(ExecutionSpace const &s, View x) {
    Kokkos::deep_copy(s, x, 1);
    s.fence();
  }
};

template <class K, typename T>
void BM_generic(benchmark::State &state) {
#if defined(KOKKOS_ENABLE_HIP)
  using ExecutionSpace = Kokkos::Experimental::HIP;
#elif defined(KOKKOS_ENABLE_CUDA)
  using ExecutionSpace = Kokkos::Cuda;
#elif defined(KOKKOS_ENABLE_OPENMPTARGET)
  using ExecutionSpace = Kokkos::Experimental::OpenMPTarget;
#else
  using ExecutionSpace = Kokkos::Experimental::SYCL;
#endif
  int n = state.range(0);
  ExecutionSpace space{};
  Kokkos::View<T *, ExecutionSpace> x("x", n);
  K(space, x);  // warm-up
  for (auto _ : state) {
    K(space, x);
  }
  state.counters["Bandwidth"] =
      benchmark::Counter(sizeof(T) * n,
                         benchmark::Counter::kIsIterationInvariantRate);
}
#define REGISTER_BENCHMARK(KERNEL, TYPE)       \
  BENCHMARK_TEMPLATE(BM_generic, KERNEL, TYPE) \
      ->RangeMultiplier(8)                          \
      ->Range(1024, 8 << 24)                        \
      ->UseRealTime();

REGISTER_BENCHMARK(TestZeroing, char);
REGISTER_BENCHMARK(TestZeroing, int);
REGISTER_BENCHMARK(TestZeroing, double);
REGISTER_BENCHMARK(TestZeroing, std::complex<double>);
REGISTER_BENCHMARK(TestZeroingAsync, char);
REGISTER_BENCHMARK(TestZeroingAsync, int);
REGISTER_BENCHMARK(TestZeroingAsync, double);
REGISTER_BENCHMARK(TestZeroingAsync, std::complex<double>);
REGISTER_BENCHMARK(TestOne, char);
REGISTER_BENCHMARK(TestOne, int);
REGISTER_BENCHMARK(TestOne, double);
REGISTER_BENCHMARK(TestOne, std::complex<double>);
REGISTER_BENCHMARK(TestOneAsync, char);
REGISTER_BENCHMARK(TestOneAsync, int);
REGISTER_BENCHMARK(TestOneAsync, double);
REGISTER_BENCHMARK(TestOneAsync, std::complex<double>);

int main(int argc, char **argv) {
  Kokkos::initialize(argc, argv);
  benchmark::Initialize(&argc, argv);
  benchmark::RunSpecifiedBenchmarks();
  Kokkos::finalize();
}

@crtrott
Copy link
Member

crtrott commented Apr 16, 2021

Hm Looks like for HIP right now its not a win generally.

@crtrott
Copy link
Member

crtrott commented Apr 16, 2021

Still probably good enough

core/src/SYCL/Kokkos_SYCL_DeepCopy.hpp Show resolved Hide resolved
core/src/Kokkos_CopyViews.hpp Outdated Show resolved Hide resolved
}

template <typename ExecutionSpace, class DT, class... DP>
inline void plain_memcpy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Plain memcpy" does not describe what this function does.

Suggested change
inline void plain_memcpy(
void copy_assign_value(

(must find a more appropriate name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point I tried making here is that we know that the buffer to be filled is contiguous in memory. What about contiguous_fill or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or flat_fill since we are calling ViewFill<ViewTypeFlat,... in the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contiguous_fill sounds good to me.

}

template <class DT, class... DP>
inline void memset(const View<DT, DP...>& dst,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline void memset(const View<DT, DP...>& dst,
void fill(const View<DT, DP...>& dst,

(must fix)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the choice here depends a little bit on the one for plain_memcpy. This function does essentially the same (and calls plain_memcpy but has a special case for the zero_byte_case).
I don't like fill since it doesn't say that the underlying memory must be contiguous. memset would say that but is too strong.
What about flat_fill_or_memset here?

core/src/Kokkos_CopyViews.hpp Show resolved Hide resolved
core/src/Kokkos_CopyViews.hpp Show resolved Hide resolved
@masterleinad
Copy link
Contributor Author

Results for Serial

Benchmark                                                                                Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_generic<TestZeroing, char>/1024/real_time                                          -0.9020         -0.9020          4887           479          4876           478
BM_generic<TestZeroing, char>/4096/real_time                                          -0.9708         -0.9708         17147           501         17106           500
BM_generic<TestZeroing, char>/32768/real_time                                         -0.9931         -0.9931        131681           903        131368           901
BM_generic<TestZeroing, char>/262144/real_time                                        -0.9962         -0.9962       1048423          3941       1045924          3932
BM_generic<TestZeroing, char>/2097152/real_time                                       -0.9935         -0.9935       8391222         54671       8371255         54541
BM_generic<TestZeroing, char>/16777216/real_time                                      -0.9906         -0.9906      67296935        629641      67136803        628139
BM_generic<TestZeroing, char>/134217728/real_time                                     -0.9666         -0.9666     536763441      17940294     535485793      17897584
BM_generic<TestZeroing, int>/1024/real_time                                           -0.8998         -0.8998          4942           495          4930           494
BM_generic<TestZeroing, int>/4096/real_time                                           -0.9681         -0.9681         17184           549         17144           548
BM_generic<TestZeroing, int>/32768/real_time                                          -0.9830         -0.9830        130980          2232        130668          2227
BM_generic<TestZeroing, int>/262144/real_time                                         -0.9830         -0.9830       1048339         17844       1045845         17801
BM_generic<TestZeroing, int>/2097152/real_time                                        -0.9664         -0.9664       8405923        282584       8385856        281912
BM_generic<TestZeroing, int>/16777216/real_time                                       -0.8707         -0.8707      68906088       8911459      68742158       8890250
BM_generic<TestZeroing, int>/134217728/real_time                                      -0.8701         -0.8701     555061138      72097790     553740220      71926229
BM_generic<TestZeroing, double>/1024/real_time                                        -0.8954         -0.8954          4937           517          4926           515
BM_generic<TestZeroing, double>/4096/real_time                                        -0.9473         -0.9473         17102           901         17062           899
BM_generic<TestZeroing, double>/32768/real_time                                       -0.9700         -0.9700        131206          3939        130894          3930
BM_generic<TestZeroing, double>/262144/real_time                                      -0.9481         -0.9481       1054388         54733       1051879         54603
BM_generic<TestZeroing, double>/2097152/real_time                                     -0.9286         -0.9286       8424736        601714       8404693        600283
BM_generic<TestZeroing, double>/16777216/real_time                                    -0.7537         -0.7537      72694836      17901660      72521320      17859075
BM_generic<TestZeroing, double>/134217728/real_time                                   -0.7516         -0.7516     582511291     144677239     581125312     144332884
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                          -0.8996         -0.8996          5540           556          5527           555
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                          -0.9302         -0.9302         19606          1368         19559          1365
BM_generic<TestZeroing, std::complex<double>>/32768/real_time                         -0.9498         -0.9498        148146          7433        147793          7415
BM_generic<TestZeroing, std::complex<double>>/262144/real_time                        -0.8913         -0.8913       1198123        130244       1195273        129934
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time                       -0.6368         -0.6368      10166558       3692680      10142377       3683888
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time                      -0.5826         -0.5826      86176366      35973148      85971321      35887494
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time                     -0.5825         -0.5825     693418424     289484171     691768256     288791217
BM_generic<TestZeroingAsync, char>/1024/real_time                                     -0.9674         -0.9674          4570           149          4559           149
BM_generic<TestZeroingAsync, char>/4096/real_time                                     -0.9903         -0.9903         17230           167         17189           167
BM_generic<TestZeroingAsync, char>/32768/real_time                                    -0.9960         -0.9960        131287           528        130975           527
BM_generic<TestZeroingAsync, char>/262144/real_time                                   -0.9965         -0.9965       1043180          3625       1040698          3617
BM_generic<TestZeroingAsync, char>/2097152/real_time                                  -0.9935         -0.9935       8417188         54352       8397156         54223
BM_generic<TestZeroingAsync, char>/16777216/real_time                                 -0.9910         -0.9910      66788165        600959      66629254        599530
BM_generic<TestZeroingAsync, char>/134217728/real_time                                -0.9664         -0.9664     533407075      17940089     532137570      17897383
BM_generic<TestZeroingAsync, int>/1024/real_time                                      -0.9648         -0.9648          4723           166          4712           166
BM_generic<TestZeroingAsync, int>/4096/real_time                                      -0.9872         -0.9872         16777           214         16737           214
BM_generic<TestZeroingAsync, int>/32768/real_time                                     -0.9854         -0.9854        130061          1893        129752          1889
BM_generic<TestZeroingAsync, int>/262144/real_time                                    -0.9835         -0.9835       1042937         17168       1040456         17127
BM_generic<TestZeroingAsync, int>/2097152/real_time                                   -0.9659         -0.9659       8433106        287305       8413041        286620
BM_generic<TestZeroingAsync, int>/16777216/real_time                                  -0.8712         -0.8712      69356151       8930061      69191096       8908815
BM_generic<TestZeroingAsync, int>/134217728/real_time                                 -0.8705         -0.8705     556870767      72101638     555545826      71930065
BM_generic<TestZeroingAsync, double>/1024/real_time                                   -0.9595         -0.9595          4634           188          4623           187
BM_generic<TestZeroingAsync, double>/4096/real_time                                   -0.9668         -0.9668         16881           561         16841           560
BM_generic<TestZeroingAsync, double>/32768/real_time                                  -0.9725         -0.9725        130409          3592        130098          3584
BM_generic<TestZeroingAsync, double>/262144/real_time                                 -0.9487         -0.9487       1051770         54002       1049268         53873
BM_generic<TestZeroingAsync, double>/2097152/real_time                                -0.9286         -0.9286       8415675        600476       8395653        599047
BM_generic<TestZeroingAsync, double>/16777216/real_time                               -0.7521         -0.7521      72409059      17953394      72236725      17910463
BM_generic<TestZeroingAsync, double>/134217728/real_time                              -0.7542         -0.7542     587193613     144341466     585796193     143997981
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time                     -0.9572         -0.9572          5159           221          5147           220
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time                     -0.9453         -0.9453         19322          1057         19276          1054
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time                    -0.9523         -0.9523        148042          7067        147690          7051
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time                   -0.8924         -0.8924       1200127        129174       1197263        128866
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time                  -0.6525         -0.6525      10298798       3579331      10274294       3570815
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time                 -0.5787         -0.5787      85685793      36102556      85481890      36016660
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time                -0.5769         -0.5769     686537264     290485540     684903418     289794251
BM_generic<TestOne, char>/1024/real_time                                              +0.0121         +0.0121          4913          4972          4901          4961
BM_generic<TestOne, char>/4096/real_time                                              +0.0024         +0.0024         17220         17262         17179         17221
BM_generic<TestOne, char>/32768/real_time                                             +0.0078         +0.0078        131493        132522        131179        132206
BM_generic<TestOne, char>/262144/real_time                                            +0.0067         +0.0067       1050450       1057474       1047950       1054959
BM_generic<TestOne, char>/2097152/real_time                                           +0.0064         +0.0064       8394045       8448131       8374072       8428028
BM_generic<TestOne, char>/16777216/real_time                                          +0.0056         +0.0056      67257828      67632155      67097829      67471278
BM_generic<TestOne, char>/134217728/real_time                                         -0.0154         -0.0154     542576595     534227311     541280857     532956198
BM_generic<TestOne, int>/1024/real_time                                               -0.0130         -0.0130          4952          4888          4940          4876
BM_generic<TestOne, int>/4096/real_time                                               +0.0105         +0.0105         17069         17249         17029         17208
BM_generic<TestOne, int>/32768/real_time                                              +0.0166         +0.0166        129978        132135        129669        131821
BM_generic<TestOne, int>/262144/real_time                                             -0.0062         -0.0062       1052086       1045578       1049572       1043078
BM_generic<TestOne, int>/2097152/real_time                                            +0.0017         +0.0017       8350547       8364411       8330686       8344516
BM_generic<TestOne, int>/16777216/real_time                                           +0.0033         +0.0033      68808774      69039019      68645053      68874773
BM_generic<TestOne, int>/134217728/real_time                                          -0.0038         -0.0038     555068341     552943954     553747144     551628170
BM_generic<TestOne, double>/1024/real_time                                            +0.0062         +0.0062          4915          4945          4903          4933
BM_generic<TestOne, double>/4096/real_time                                            +0.0086         +0.0086         17135         17282         17094         17241
BM_generic<TestOne, double>/32768/real_time                                           +0.0058         +0.0058        130918        131675        130606        131362
BM_generic<TestOne, double>/262144/real_time                                          +0.0190         +0.0190       1053672       1073705       1051165       1071151
BM_generic<TestOne, double>/2097152/real_time                                         +0.0167         +0.0167       8429348       8570384       8409205       8549913
BM_generic<TestOne, double>/16777216/real_time                                        -0.0031         -0.0031      72898490      72674677      72724997      72501712
BM_generic<TestOne, double>/134217728/real_time                                       +0.0001         +0.0001     587412461     587461762     586014035     586063205
BM_generic<TestOne, std::complex<double>>/1024/real_time                              +0.0051         +0.0050          5530          5558          5517          5545
BM_generic<TestOne, std::complex<double>>/4096/real_time                              -0.0088         -0.0088         19522         19349         19475         19303
BM_generic<TestOne, std::complex<double>>/32768/real_time                             -0.0033         -0.0033        148283        147791        147930        147439
BM_generic<TestOne, std::complex<double>>/262144/real_time                            +0.0152         +0.0152       1200378       1218575       1197522       1215676
BM_generic<TestOne, std::complex<double>>/2097152/real_time                           -0.0134         -0.0134      10319961      10181267      10295410      10157052
BM_generic<TestOne, std::complex<double>>/16777216/real_time                          +0.0011         +0.0011      85718654      85809437      85514013      85605191
BM_generic<TestOne, std::complex<double>>/134217728/real_time                         -0.0039         -0.0039     685574510     682882417     683943150     681256795
BM_generic<TestOneAsync, char>/1024/real_time                                         -0.0097         -0.0097          4605          4560          4594          4549
BM_generic<TestOneAsync, char>/4096/real_time                                         -0.0223         -0.0223         17188         16804         17148         16764
BM_generic<TestOneAsync, char>/32768/real_time                                        -0.0111         -0.0111        131499        130040        131186        129730
BM_generic<TestOneAsync, char>/262144/real_time                                       -0.0034         -0.0034       1047166       1043605       1044675       1041122
BM_generic<TestOneAsync, char>/2097152/real_time                                      +0.0043         +0.0043       8377148       8413417       8357156       8393404
BM_generic<TestOneAsync, char>/16777216/real_time                                     +0.0033         +0.0033      66814028      67032434      66655112      66872946
BM_generic<TestOneAsync, char>/134217728/real_time                                    -0.0028         -0.0028     534943018     533459458     533669443     532189375
BM_generic<TestOneAsync, int>/1024/real_time                                          -0.0315         -0.0315          4703          4555          4692          4544
BM_generic<TestOneAsync, int>/4096/real_time                                          +0.0031         +0.0031         16630         16681         16591         16642
BM_generic<TestOneAsync, int>/32768/real_time                                         +0.0045         +0.0045        130000        130590        129691        130280
BM_generic<TestOneAsync, int>/262144/real_time                                        -0.0133         -0.0133       1055038       1041039       1052528       1038562
BM_generic<TestOneAsync, int>/2097152/real_time                                       -0.0082         -0.0082       8430929       8362095       8410868       8342203
BM_generic<TestOneAsync, int>/16777216/real_time                                      +0.0065         +0.0065      68848282      69297873      68683887      69132223
BM_generic<TestOneAsync, int>/134217728/real_time                                     -0.0159         -0.0159     559700504     550808917     558368274     549497982
BM_generic<TestOneAsync, double>/1024/real_time                                       -0.0097         -0.0097          4602          4557          4591          4547
BM_generic<TestOneAsync, double>/4096/real_time                                       -0.0140         -0.0140         17025         16787         16984         16747
BM_generic<TestOneAsync, double>/32768/real_time                                      +0.0050         +0.0050        129986        130640        129676        130330
BM_generic<TestOneAsync, double>/262144/real_time                                     -0.0077         -0.0077       1063443       1055208       1060913       1052698
BM_generic<TestOneAsync, double>/2097152/real_time                                    +0.0093         +0.0093       8427490       8506237       8407443       8486002
BM_generic<TestOneAsync, double>/16777216/real_time                                   -0.0026         -0.0026      72655028      72466537      72482067      72294118
BM_generic<TestOneAsync, double>/134217728/real_time                                  -0.0043         -0.0043     583187116     580657056     581799212     579274828
BM_generic<TestOneAsync, std::complex<double>>/1024/real_time                         -0.0067         -0.0067          5154          5120          5142          5108
BM_generic<TestOneAsync, std::complex<double>>/4096/real_time                         -0.0092         -0.0092         19249         19071         19204         19026
BM_generic<TestOneAsync, std::complex<double>>/32768/real_time                        -0.0062         -0.0062        148996        148072        148641        147719
BM_generic<TestOneAsync, std::complex<double>>/262144/real_time                       +0.0022         +0.0022       1201622       1204229       1198754       1201364
BM_generic<TestOneAsync, std::complex<double>>/2097152/real_time                      -0.0140         -0.0140      10340443      10195930      10315841      10171675
BM_generic<TestOneAsync, std::complex<double>>/16777216/real_time                     +0.0007         +0.0007      85019486      85079729      84817183      84877218
BM_generic<TestOneAsync, std::complex<double>>/134217728/real_time                    +0.0007         +0.0007     683173146     683650634     681547502     682023979

are pretty impressive.

@masterleinad
Copy link
Contributor Author

@dalg24 I addressed your commits apart from the naming ones where I'd like to discuss some more and also used std::memset for the Serial backend.

@masterleinad
Copy link
Contributor Author

I also looked into doing the memset in parallel for OpenMP doing something like

size_t size = sizeof(typename View<DT, DP...>::const_value_type)*dst.size();
auto* buffer = reinterpret_cast<char*>(dst.data());
#pragma omp parallel
{
  int id = omp_get_thread_num();
  int num = omp_get_num_threads();
  size_t my_start = (id*size)/num;
  size_t my_size = ((id+1)*size)/num - my_start;
  std::memset(buffer + my_start, 0, my_size);
}

but it's not quite clear if that's better (or what a good threshold would be) than a single memset:

BM_generic<TestZeroing, char>/1024/real_time                                         +10.2337        +10.2332            95          1063            95          1063
BM_generic<TestZeroing, char>/4096/real_time                                          +7.8544         +7.8539           120          1060           120          1060
BM_generic<TestZeroing, char>/32768/real_time                                         +1.0276         +1.0278           560          1136           560          1136
BM_generic<TestZeroing, char>/262144/real_time                                        -0.5036         -0.5037          4105          2038          4105          2037
BM_generic<TestZeroing, char>/2097152/real_time                                       -0.7343         -0.7344         38499         10227         38499         10227
BM_generic<TestZeroing, char>/16777216/real_time                                      -0.2850         -0.2852        543939        388891        543937        388782
BM_generic<TestZeroing, char>/134217728/real_time                                     -0.1509         -0.1512       3967797       3369140       3967654       3367793
BM_generic<TestZeroing, int>/1024/real_time                                           +7.7913         +7.7909           120          1053           120          1053
BM_generic<TestZeroing, int>/4096/real_time                                           +3.7502         +3.7508           227          1077           227          1077
BM_generic<TestZeroing, int>/32768/real_time                                          -0.1424         -0.1424          2023          1735          2023          1735
BM_generic<TestZeroing, int>/262144/real_time                                         -0.7303         -0.7302         18883          5093         18880          5093
BM_generic<TestZeroing, int>/2097152/real_time                                        -0.4721         -0.4722        204305        107852        204306        107835
BM_generic<TestZeroing, int>/16777216/real_time                                       -0.1693         -0.1695       2051037       1703871       2050722       1703135
BM_generic<TestZeroing, int>/134217728/real_time                                      +0.0911         +0.0903      16340390      17828987      16340175      17815274
BM_generic<TestZeroing, double>/1024/real_time                                        +5.8647         +5.8640           154          1060           154          1060
BM_generic<TestZeroing, double>/4096/real_time                                        +0.9849         +0.9851           559          1109           559          1109
BM_generic<TestZeroing, double>/32768/real_time                                       -0.5245         -0.5245          4300          2045          4300          2045
BM_generic<TestZeroing, double>/262144/real_time                                      -0.7347         -0.7347         38615         10243         38610         10242
BM_generic<TestZeroing, double>/2097152/real_time                                     -0.3527         -0.3532        626806        405722        626808        405405
BM_generic<TestZeroing, double>/16777216/real_time                                    -0.1660         -0.1661       3991886       3329258       3991175       3328405
BM_generic<TestZeroing, double>/134217728/real_time                                   +0.0799         +0.0795      33937433      36649575      33937070      36634848
BM_generic<TestZeroing, std::complex<double>>/1024/real_time                          +3.8481         +3.8481           223          1082           223          1082
BM_generic<TestZeroing, std::complex<double>>/4096/real_time                          +0.1992         +0.1992          1024          1228          1024          1227
BM_generic<TestZeroing, std::complex<double>>/32768/real_time                         -0.6559         -0.6559          8831          3039          8831          3039
BM_generic<TestZeroing, std::complex<double>>/262144/real_time                        -0.7274         -0.7274         78090         21286         78090         21284
BM_generic<TestZeroing, std::complex<double>>/2097152/real_time                       -0.2831         -0.2834       1136611        814828       1136603        814442
BM_generic<TestZeroing, std::complex<double>>/16777216/real_time                      -0.0198         -0.0204       7840790       7685210       7839464       7679439
BM_generic<TestZeroing, std::complex<double>>/134217728/real_time                     +0.0834         +0.0827      68077994      73757266      68075081      73702412
BM_generic<TestZeroingAsync, char>/1024/real_time                                    +10.9753        +10.9777            88          1052            88          1052
BM_generic<TestZeroingAsync, char>/4096/real_time                                     +8.1460         +8.1456           115          1048           115          1048
BM_generic<TestZeroingAsync, char>/32768/real_time                                    +1.0522         +1.0520           535          1098           535          1098
BM_generic<TestZeroingAsync, char>/262144/real_time                                   -0.5129         -0.5129          4150          2021          4150          2021
BM_generic<TestZeroingAsync, char>/2097152/real_time                                  -0.7340         -0.7340         38268         10181         38268         10181
BM_generic<TestZeroingAsync, char>/16777216/real_time                                 -0.3511         -0.3513        629626        408545        629542        408390
BM_generic<TestZeroingAsync, char>/134217728/real_time                                -0.1884         -0.1887       4102184       3329200       4102192       3328226
BM_generic<TestZeroingAsync, int>/1024/real_time                                      +8.0819         +8.0812           116          1050           116          1050
BM_generic<TestZeroingAsync, int>/4096/real_time                                      +3.8830         +3.8826           219          1070           219          1070
BM_generic<TestZeroingAsync, int>/32768/real_time                                     -0.1576         -0.1576          2034          1713          2034          1713
BM_generic<TestZeroingAsync, int>/262144/real_time                                    -0.7309         -0.7309         18848          5073         18848          5072
BM_generic<TestZeroingAsync, int>/2097152/real_time                                   -0.5350         -0.5351        204505         95097        204473         95068
BM_generic<TestZeroingAsync, int>/16777216/real_time                                  -0.1317         -0.1321       1965527       1706758       1965532       1705950
BM_generic<TestZeroingAsync, int>/134217728/real_time                                 +0.0902         +0.0899      16354778      17829235      16350384      17819684
BM_generic<TestZeroingAsync, double>/1024/real_time                                   +6.0176         +6.0171           150          1055           150          1055
BM_generic<TestZeroingAsync, double>/4096/real_time                                   +1.0573         +1.0572           539          1109           539          1109
BM_generic<TestZeroingAsync, double>/32768/real_time                                  -0.5402         -0.5402          4400          2023          4399          2023
BM_generic<TestZeroingAsync, double>/262144/real_time                                 -0.7345         -0.7346         38527         10227         38527         10226
BM_generic<TestZeroingAsync, double>/2097152/real_time                                -0.3438         -0.3439        622456        408437        622372        408321
BM_generic<TestZeroingAsync, double>/16777216/real_time                               -0.1745         -0.1749       4026580       3324130       4026591       3322449
BM_generic<TestZeroingAsync, double>/134217728/real_time                              +0.0811         +0.0804      33884142      36632151      33884271      36607247
BM_generic<TestZeroingAsync, std::complex<double>>/1024/real_time                     +3.8902         +3.8897           218          1068           218          1068
BM_generic<TestZeroingAsync, std::complex<double>>/4096/real_time                     +0.1706         +0.1707          1043          1221          1043          1220
BM_generic<TestZeroingAsync, std::complex<double>>/32768/real_time                    -0.6602         -0.6603          8879          3017          8879          3016
BM_generic<TestZeroingAsync, std::complex<double>>/262144/real_time                   -0.7272         -0.7272         77914         21255         77914         21253
BM_generic<TestZeroingAsync, std::complex<double>>/2097152/real_time                  -0.3139         -0.3141       1223122        839226       1223118        838901
BM_generic<TestZeroingAsync, std::complex<double>>/16777216/real_time                 -0.0318         -0.0322       7837143       7587714       7837082       7584490
BM_generic<TestZeroingAsync, std::complex<double>>/134217728/real_time                +0.0859         +0.0857      67977581      73819855      67966571      73788709
BM_generic<TestOne, char>/1024/real_time                                              +0.0545         +0.0544          1252          1320          1252          1320
BM_generic<TestOne, char>/4096/real_time                                              +0.0356         +0.0356          1628          1686          1628          1686
BM_generic<TestOne, char>/32768/real_time                                             +0.0104         +0.0103          4738          4787          4738          4787
BM_generic<TestOne, char>/262144/real_time                                            -0.0010         -0.0009         29891         29861         29887         29859
BM_generic<TestOne, char>/2097152/real_time                                           -0.0012         -0.0013        229094        228822        229093        228799
BM_generic<TestOne, char>/16777216/real_time                                          -0.0005         -0.0005       1819794       1818908       1819537       1818632
BM_generic<TestOne, char>/134217728/real_time                                         +0.0000         -0.0001      14515161      14515325      14515103      14513096
BM_generic<TestOne, int>/1024/real_time                                               +0.0444         +0.0445          1200          1254          1200          1254
BM_generic<TestOne, int>/4096/real_time                                               +0.0532         +0.0532          1245          1311          1245          1311
BM_generic<TestOne, int>/32768/real_time                                              +0.0498         +0.0499          2118          2224          2118          2224
BM_generic<TestOne, int>/262144/real_time                                             -0.0182         -0.0182          8095          7948          8095          7948
BM_generic<TestOne, int>/2097152/real_time                                            -0.0254         -0.0256         67020         65314         67020         65306
BM_generic<TestOne, int>/16777216/real_time                                           +0.0084         +0.0080       1285524       1296280       1285315       1295542
BM_generic<TestOne, int>/134217728/real_time                                          +0.0003         -0.0002      38000738      38010376      38000847      37991355
BM_generic<TestOne, double>/1024/real_time                                            +0.0255         +0.0255          1216          1247          1216          1247
BM_generic<TestOne, double>/4096/real_time                                            +0.0248         +0.0248          1332          1365          1332          1365
BM_generic<TestOne, double>/32768/real_time                                           -0.0077         -0.0076          2837          2815          2837          2815
BM_generic<TestOne, double>/262144/real_time                                          +0.0038         +0.0037         15861         15921         15860         15920
BM_generic<TestOne, double>/2097152/real_time                                         +0.0054         +0.0054        221422        222617        221387        222580
BM_generic<TestOne, double>/16777216/real_time                                        +0.0123         +0.0120       2974411       3010927       2974202       3010001
BM_generic<TestOne, double>/134217728/real_time                                       -0.0004         -0.0007      76893525      76862987      76876481      76823400
BM_generic<TestOne, std::complex<double>>/1024/real_time                              +0.0100         +0.0100          1304          1317          1304          1317
BM_generic<TestOne, std::complex<double>>/4096/real_time                              -0.0038         -0.0037          1725          1718          1725          1718
BM_generic<TestOne, std::complex<double>>/32768/real_time                             -0.0215         -0.0216          5114          5004          5114          5004
BM_generic<TestOne, std::complex<double>>/262144/real_time                            +0.0053         +0.0053         32167         32336         32163         32334
BM_generic<TestOne, std::complex<double>>/2097152/real_time                           -0.0039         -0.0042        579797        577513        579786        577366
BM_generic<TestOne, std::complex<double>>/16777216/real_time                          -0.0115         -0.0120      15653291      15472524      15652900      15465697
BM_generic<TestOne, std::complex<double>>/134217728/real_time                         +0.0010         +0.0007     153962404     154117680     153957086     154064740
BM_generic<TestOneAsync, char>/1024/real_time                                         +0.0338         +0.0341          1274          1317          1274          1317
BM_generic<TestOneAsync, char>/4096/real_time                                         +0.0254         +0.0253          1650          1692          1650          1692
BM_generic<TestOneAsync, char>/32768/real_time                                        +0.0003         +0.0004          4760          4761          4759          4761
BM_generic<TestOneAsync, char>/262144/real_time                                       -0.0009         -0.0010         29908         29880         29908         29879
BM_generic<TestOneAsync, char>/2097152/real_time                                      -0.0019         -0.0018        229127        228702        229095        228686
BM_generic<TestOneAsync, char>/16777216/real_time                                     -0.0002         -0.0003       1819634       1819266       1819630       1819077
BM_generic<TestOneAsync, char>/134217728/real_time                                    -0.0001         -0.0000      14517688      14516472      14515609      14514885
BM_generic<TestOneAsync, int>/1024/real_time                                          +0.0402         +0.0396          1200          1249          1200          1248
BM_generic<TestOneAsync, int>/4096/real_time                                          +0.0336         +0.0335          1256          1298          1256          1298
BM_generic<TestOneAsync, int>/32768/real_time                                         +0.0028         +0.0028          2197          2203          2197          2203
BM_generic<TestOneAsync, int>/262144/real_time                                        -0.0228         -0.0229          8109          7924          8109          7923
BM_generic<TestOneAsync, int>/2097152/real_time                                       -0.0259         -0.0258         67018         65283         67004         65278
BM_generic<TestOneAsync, int>/16777216/real_time                                      -0.0033         -0.0035       1289588       1285376       1289563       1285094
BM_generic<TestOneAsync, int>/134217728/real_time                                     +0.0053         +0.0048      37844437      38045205      37836351      38018931
BM_generic<TestOneAsync, double>/1024/real_time                                       +0.0183         +0.0182          1213          1236          1213          1235
BM_generic<TestOneAsync, double>/4096/real_time                                       +0.0252         +0.0253          1326          1360          1326          1359
BM_generic<TestOneAsync, double>/32768/real_time                                      -0.0425         -0.0420          2937          2812          2935          2812
BM_generic<TestOneAsync, double>/262144/real_time                                     +0.0032         +0.0033         15834         15884         15831         15883
BM_generic<TestOneAsync, double>/2097152/real_time                                    +0.0051         +0.0049        221398        222524        221386        222464
BM_generic<TestOneAsync, double>/16777216/real_time                                   -0.3805         -0.3806       4613855       2858137       4613045       2857524
BM_generic<TestOneAsync, double>/134217728/real_time                                  +0.0002         -0.0003      76850781      76869107      76847478      76824442
BM_generic<TestOneAsync, std::complex<double>>/1024/real_time                         +0.0179         +0.0180          1280          1303          1280          1303
BM_generic<TestOneAsync, std::complex<double>>/4096/real_time                         +0.0030         +0.0029          1697          1702          1697          1702
BM_generic<TestOneAsync, std::complex<double>>/32768/real_time                        -0.0244         -0.0244          5084          4960          5083          4959
BM_generic<TestOneAsync, std::complex<double>>/262144/real_time                       +0.0060         +0.0060         32013         32206         32013         32204
BM_generic<TestOneAsync, std::complex<double>>/2097152/real_time                      +0.0011         +0.0008        583212        583846        583207        583646
BM_generic<TestOneAsync, std::complex<double>>/16777216/real_time                     -0.0018         -0.0028      15545784      15518490      15542691      15499532
BM_generic<TestOneAsync, std::complex<double>>/134217728/real_time                    +0.0003         -0.0001     154043217     154085789     154011225     153996562

@masterleinad
Copy link
Contributor Author

masterleinad commented Apr 26, 2021

gcc complains if we use memcpy for types that are not trivially-copyable and trivial so this what I restricted the functionality here to now, too. I'm happy to discuss if we want to lift that restriction from our side.

Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Damiens comment still stands that plain_memcpy is not a good one probably.

core/src/Kokkos_CopyViews.hpp Show resolved Hide resolved
@crtrott
Copy link
Member

crtrott commented Apr 26, 2021

gcc complains if we use memcpy for types that are not trivially-copyable and trivial so this what I restricted the functionality here to now, too. I'm happy to discuss if we want to lift that restriction from our side.

Yeah thats fine.

@masterleinad
Copy link
Contributor Author

I replaced plain_memcpy -> contiguous_fill and memset -> contiguous_fill_or_memset now. Let me know if that's fine or if you have better suggestions.

@dalg24
Copy link
Member

dalg24 commented Apr 28, 2021

Cleanup history or want me to squash and merge?

@masterleinad
Copy link
Contributor Author

Just squash and merge. I don't think there is anything worth keeping in the history.

@dalg24 dalg24 merged commit 6b569fa into kokkos:develop Apr 28, 2021
@masterleinad masterleinad deleted the cuda_deep_copy_memset branch August 10, 2021 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants