Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP: No memset in viewfill #6573

Merged
merged 4 commits into from
Nov 9, 2023
Merged

Conversation

rgayatri23
Copy link
Contributor

@rgayatri23 rgayatri23 commented Nov 2, 2023

The PR attempts to fix issue #6480 where filling a view with 0's is very slow compared to filling the same with 1's.
The slowdown observed is repeatable with clang and g++ compilers and with OpenMP and Serial backend

Example to test the solution with OpenMP

#include <Kokkos_Core.hpp>

using Device = Kokkos::DefaultExecutionSpace;
using exec_space = typename Device::execution_space;
using type = int64_t;
using policy = Kokkos::RangePolicy<exec_space>;
using view = Kokkos::View<type*, Device>;

void fill_test(size_t n, type x) {
    view a(Kokkos::view_alloc(Kokkos::WithoutInitializing, "test"), n);
    uint32_t t_iter = 5;
    for (uint32_t i = 0; i < t_iter; i++) {
        Kokkos::fence();
        Kokkos::Timer t;
        Kokkos::deep_copy(exec_space(), a, x);
        Kokkos::fence();
        double d = t.seconds();
        double bandwidth = static_cast<double>(sizeof(type) * n) / d;
        printf("Write time: %.2fs; Write Bandwidth: %.2fGB/s\n", d,
               bandwidth / 1000000000);
    }
}

int main() {
    Kokkos::initialize();
    {
        for (size_t n = 100000; n < 100000000; n *= 10) {
            printf("n = %zu, Fill test -1\n",n);
            fill_test(n, -1);

            printf("n = %zu, Fill test 0\n",n);
            fill_test(n, 0);
        }
    }
    Kokkos::finalize();
    return 0;
}

Results with OMP_NUM_THREADS=8 with gcc/11.2 on AMD EPYC.
Similar behavior observed with more number of threads too.

current develop

n = 100000, Fill test -1
Write time: 0.00s; Write Bandwidth: 4.61GB/s
Write time: 0.00s; Write Bandwidth: 45.42GB/s
Write time: 0.00s; Write Bandwidth: 51.78GB/s
Write time: 0.00s; Write Bandwidth: 54.10GB/s
Write time: 0.00s; Write Bandwidth: 55.33GB/s
n = 100000, Fill test 0
Write time: 0.00s; Write Bandwidth: 2.21GB/s
Write time: 0.00s; Write Bandwidth: 38.52GB/s
Write time: 0.00s; Write Bandwidth: 44.66GB/s
Write time: 0.00s; Write Bandwidth: 38.48GB/s
Write time: 0.00s; Write Bandwidth: 43.51GB/s
n = 1000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 18.05GB/s
Write time: 0.00s; Write Bandwidth: 468.30GB/s
Write time: 0.00s; Write Bandwidth: 426.53GB/s
Write time: 0.00s; Write Bandwidth: 488.37GB/s
Write time: 0.00s; Write Bandwidth: 490.14GB/s
n = 1000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 6.71GB/s
Write time: 0.00s; Write Bandwidth: 50.00GB/s
Write time: 0.00s; Write Bandwidth: 52.21GB/s
Write time: 0.00s; Write Bandwidth: 52.14GB/s
Write time: 0.00s; Write Bandwidth: 52.29GB/s
n = 10000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 83.36GB/s
Write time: 0.00s; Write Bandwidth: 749.45GB/s
Write time: 0.00s; Write Bandwidth: 650.32GB/s
Write time: 0.00s; Write Bandwidth: 709.24GB/s
Write time: 0.00s; Write Bandwidth: 749.23GB/s
n = 10000000, Fill test 0
Write time: 0.01s; Write Bandwidth: 10.28GB/s
Write time: 0.00s; Write Bandwidth: 44.09GB/s
Write time: 0.00s; Write Bandwidth: 49.85GB/s
Write time: 0.00s; Write Bandwidth: 49.66GB/s
Write time: 0.00s; Write Bandwidth: 49.94GB/s

This PR

n = 100000, Fill test -1
Write time: 0.00s; Write Bandwidth: 4.58GB/s
Write time: 0.00s; Write Bandwidth: 54.69GB/s
Write time: 0.00s; Write Bandwidth: 56.95GB/s
Write time: 0.00s; Write Bandwidth: 67.04GB/s
Write time: 0.00s; Write Bandwidth: 68.07GB/s
n = 100000, Fill test 0
Write time: 0.00s; Write Bandwidth: 5.39GB/s
Write time: 0.00s; Write Bandwidth: 55.49GB/s
Write time: 0.00s; Write Bandwidth: 66.43GB/s
Write time: 0.00s; Write Bandwidth: 67.72GB/s
Write time: 0.00s; Write Bandwidth: 76.26GB/s
n = 1000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 20.49GB/s
Write time: 0.00s; Write Bandwidth: 432.99GB/s
Write time: 0.00s; Write Bandwidth: 457.82GB/s
Write time: 0.00s; Write Bandwidth: 462.61GB/s
Write time: 0.00s; Write Bandwidth: 472.76GB/s
n = 1000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 21.60GB/s
Write time: 0.00s; Write Bandwidth: 430.20GB/s
Write time: 0.00s; Write Bandwidth: 462.05GB/s
Write time: 0.00s; Write Bandwidth: 467.48GB/s
Write time: 0.00s; Write Bandwidth: 483.03GB/s
n = 10000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 133.27GB/s
Write time: 0.00s; Write Bandwidth: 761.82GB/s
Write time: 0.00s; Write Bandwidth: 766.50GB/s
Write time: 0.00s; Write Bandwidth: 771.60GB/s
Write time: 0.00s; Write Bandwidth: 764.00GB/s
n = 10000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 126.01GB/s
Write time: 0.00s; Write Bandwidth: 761.09GB/s
Write time: 0.00s; Write Bandwidth: 770.19GB/s
Write time: 0.00s; Write Bandwidth: 770.12GB/s
Write time: 0.00s; Write Bandwidth: 769.30GB/s

In a follow up PR, I will add it as a benchmark.

@rgayatri23 rgayatri23 added Performance Code showing unusually slow performance for an architecture and/or backend Kokkos-Core labels Nov 2, 2023
@rgayatri23 rgayatri23 self-assigned this Nov 2, 2023
@masterleinad
Copy link
Contributor

I'm at least surprised that a manual loop would be faster than memset for the Serial backend.

@rgayatri23
Copy link
Contributor Author

I'm at least surprised that a manual loop would be faster than memset for the Serial backend.

Sorry I was wrong on that one, its the opposite, memset is 2x faster for Serial backend. I corrected it.

OpenMP: do not use memset for 0's only if execution space is OpenMP

Co-authored-by: Daniel Arndt <arndtd@ornl.gov>
@fnrizzi
Copy link
Contributor

fnrizzi commented Nov 9, 2023

maybe it makes sense to add this to the benchmarks?

@masterleinad
Copy link
Contributor

We should probably do the same for the View construction, see Kokkos_ViewMapping.hpp (but that can be in a follow-up).

@rgayatri23
Copy link
Contributor Author

Just posting the observations of slow down observed in filling views with 0's with OpenMP backend with different threads in the current develop branch.

OMP_NUM_THREADS=2

n = 100000, Fill test -1
Write time: 0.00s; Write Bandwidth: 2.26GB/s
Write time: 0.00s; Write Bandwidth: 34.84GB/s
Write time: 0.00s; Write Bandwidth: 37.99GB/s
Write time: 0.00s; Write Bandwidth: 36.58GB/s
Write time: 0.00s; Write Bandwidth: 37.45GB/s
n = 100000, Fill test 0
Write time: 0.00s; Write Bandwidth: 1.86GB/s
Write time: 0.00s; Write Bandwidth: 55.52GB/s
Write time: 0.00s; Write Bandwidth: 58.28GB/s
Write time: 0.00s; Write Bandwidth: 60.72GB/s
Write time: 0.00s; Write Bandwidth: 60.91GB/s
n = 1000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 8.17GB/s
Write time: 0.00s; Write Bandwidth: 92.28GB/s
Write time: 0.00s; Write Bandwidth: 104.30GB/s
Write time: 0.00s; Write Bandwidth: 96.42GB/s
Write time: 0.00s; Write Bandwidth: 98.13GB/s
n = 1000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 9.23GB/s
Write time: 0.00s; Write Bandwidth: 58.10GB/s
Write time: 0.00s; Write Bandwidth: 58.27GB/s
Write time: 0.00s; Write Bandwidth: 58.28GB/s
Write time: 0.00s; Write Bandwidth: 58.28GB/s
n = 10000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 20.64GB/s
Write time: 0.00s; Write Bandwidth: 50.90GB/s
Write time: 0.00s; Write Bandwidth: 59.54GB/s
Write time: 0.00s; Write Bandwidth: 59.32GB/s
Write time: 0.00s; Write Bandwidth: 59.47GB/s
n = 10000000, Fill test 0
Write time: 0.01s; Write Bandwidth: 12.12GB/s
Write time: 0.00s; Write Bandwidth: 44.90GB/s
Write time: 0.00s; Write Bandwidth: 49.49GB/s
Write time: 0.00s; Write Bandwidth: 49.87GB/s
Write time: 0.00s; Write Bandwidth: 44.36GB/s

OMP_NUM_THREADS=4

n = 100000, Fill test -1
Write time: 0.00s; Write Bandwidth: 3.60GB/s
Write time: 0.00s; Write Bandwidth: 64.18GB/s
Write time: 0.00s; Write Bandwidth: 64.08GB/s
Write time: 0.00s; Write Bandwidth: 62.62GB/s
Write time: 0.00s; Write Bandwidth: 65.28GB/s
n = 100000, Fill test 0
Write time: 0.00s; Write Bandwidth: 2.54GB/s
Write time: 0.00s; Write Bandwidth: 53.20GB/s
Write time: 0.00s; Write Bandwidth: 56.67GB/s
Write time: 0.00s; Write Bandwidth: 58.28GB/s
Write time: 0.00s; Write Bandwidth: 60.40GB/s
n = 1000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 12.79GB/s
Write time: 0.00s; Write Bandwidth: 171.67GB/s
Write time: 0.00s; Write Bandwidth: 168.80GB/s
Write time: 0.00s; Write Bandwidth: 171.38GB/s
Write time: 0.00s; Write Bandwidth: 172.56GB/s
n = 1000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 7.96GB/s
Write time: 0.00s; Write Bandwidth: 58.38GB/s
Write time: 0.00s; Write Bandwidth: 58.63GB/s
Write time: 0.00s; Write Bandwidth: 58.49GB/s
Write time: 0.00s; Write Bandwidth: 58.41GB/s
n = 10000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 49.40GB/s
Write time: 0.00s; Write Bandwidth: 183.65GB/s
Write time: 0.00s; Write Bandwidth: 203.21GB/s
Write time: 0.00s; Write Bandwidth: 216.32GB/s
Write time: 0.00s; Write Bandwidth: 216.86GB/s
n = 10000000, Fill test 0
Write time: 0.01s; Write Bandwidth: 13.58GB/s
Write time: 0.00s; Write Bandwidth: 45.25GB/s
Write time: 0.00s; Write Bandwidth: 43.72GB/s
Write time: 0.00s; Write Bandwidth: 47.85GB/s
Write time: 0.00s; Write Bandwidth: 50.10GB/s

OMP_NUM_THREADS=8

n = 100000, Fill test -1
Write time: 0.00s; Write Bandwidth: 4.51GB/s
Write time: 0.00s; Write Bandwidth: 81.81GB/s
Write time: 0.00s; Write Bandwidth: 80.01GB/s
Write time: 0.00s; Write Bandwidth: 87.36GB/s
Write time: 0.00s; Write Bandwidth: 93.83GB/s
n = 100000, Fill test 0
Write time: 0.00s; Write Bandwidth: 2.49GB/s
Write time: 0.00s; Write Bandwidth: 51.91GB/s
Write time: 0.00s; Write Bandwidth: 55.18GB/s
Write time: 0.00s; Write Bandwidth: 58.93GB/s
Write time: 0.00s; Write Bandwidth: 60.77GB/s
n = 1000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 12.32GB/s
Write time: 0.00s; Write Bandwidth: 265.18GB/s
Write time: 0.00s; Write Bandwidth: 273.25GB/s
Write time: 0.00s; Write Bandwidth: 236.29GB/s
Write time: 0.00s; Write Bandwidth: 270.30GB/s
n = 1000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 7.50GB/s
Write time: 0.00s; Write Bandwidth: 52.09GB/s
Write time: 0.00s; Write Bandwidth: 52.31GB/s
Write time: 0.00s; Write Bandwidth: 52.35GB/s
Write time: 0.00s; Write Bandwidth: 52.29GB/s
n = 10000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 60.60GB/s
Write time: 0.00s; Write Bandwidth: 314.60GB/s
Write time: 0.00s; Write Bandwidth: 314.42GB/s
Write time: 0.00s; Write Bandwidth: 315.79GB/s
Write time: 0.00s; Write Bandwidth: 337.31GB/s
n = 10000000, Fill test 0
Write time: 0.01s; Write Bandwidth: 11.21GB/s
Write time: 0.00s; Write Bandwidth: 38.54GB/s
Write time: 0.00s; Write Bandwidth: 44.82GB/s
Write time: 0.00s; Write Bandwidth: 39.06GB/s
Write time: 0.00s; Write Bandwidth: 45.75GB/s

The difference is more prominent with higher number of threads.

@dalg24
Copy link
Member

dalg24 commented Nov 9, 2023

Adding a reference to the PR that introduced ZeroMemset #3944 (comment)
At the time, the discussion focused on whether to use a single memset VS split the array and do one memset per thread. There is mention of comparison against an omp parallel for.

@masterleinad
Copy link
Contributor

masterleinad commented Nov 9, 2023

Seeing

n = 100000, Fill test -1
Write time: 0.02s; Write Bandwidth: 0.05GB/s
Write time: 0.00s; Write Bandwidth: 100.29GB/s
Write time: 0.00s; Write Bandwidth: 120.08GB/s
Write time: 0.00s; Write Bandwidth: 112.74GB/s
Write time: 0.00s; Write Bandwidth: 130.25GB/s
n = 100000, Fill test 0
Write time: 0.00s; Write Bandwidth: 2.09GB/s
Write time: 0.00s; Write Bandwidth: 36.28GB/s
Write time: 0.00s; Write Bandwidth: 48.19GB/s
Write time: 0.00s; Write Bandwidth: 49.36GB/s
Write time: 0.00s; Write Bandwidth: 49.61GB/s
n = 1000000, Fill test -1
Write time: 0.00s; Write Bandwidth: 2.73GB/s
Write time: 0.01s; Write Bandwidth: 1.00GB/s
Write time: 0.00s; Write Bandwidth: 2.02GB/s
Write time: 0.00s; Write Bandwidth: 777.30GB/s
Write time: 0.00s; Write Bandwidth: 711.74GB/s
n = 1000000, Fill test 0
Write time: 0.00s; Write Bandwidth: 3.83GB/s
Write time: 0.00s; Write Bandwidth: 16.67GB/s
Write time: 0.00s; Write Bandwidth: 25.60GB/s
Write time: 0.00s; Write Bandwidth: 25.60GB/s
Write time: 0.00s; Write Bandwidth: 25.02GB/s
n = 10000000, Fill test -1
Write time: 0.04s; Write Bandwidth: 2.21GB/s
Write time: 0.01s; Write Bandwidth: 6.77GB/s
Write time: 0.00s; Write Bandwidth: 85.59GB/s
Write time: 0.00s; Write Bandwidth: 169.72GB/s
Write time: 0.00s; Write Bandwidth: 351.79GB/s
n = 10000000, Fill test 0
Write time: 0.01s; Write Bandwidth: 7.08GB/s
Write time: 0.01s; Write Bandwidth: 14.39GB/s
Write time: 0.01s; Write Bandwidth: 14.39GB/s
Write time: 0.01s; Write Bandwidth: 14.24GB/s
Write time: 0.01s; Write Bandwidth: 14.39GB/s

on SapphireRapids (104 cores) using develop. I can confirm that it's better not to use memset when we can use more than like 2 threads.

@dalg24 dalg24 merged commit 3f773d0 into kokkos:develop Nov 9, 2023
27 of 28 checks passed
@dalg24 dalg24 mentioned this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kokkos-Core Performance Code showing unusually slow performance for an architecture and/or backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants