Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel strided iteration does not scale linearly #5

Open
mratsim opened this issue Nov 4, 2018 · 0 comments
Open

Parallel strided iteration does not scale linearly #5

mratsim opened this issue Nov 4, 2018 · 0 comments

Comments

@mratsim
Copy link
Owner

mratsim commented Nov 4, 2018

On the following benchmark on my dual core machine with hyper-threading enabled:

https://github.com/numforge/laser/blob/d725651c0f8ca1da3f761e27e9846bc81b9341f3/benchmarks/loop_iteration/iter_bench_prod.nim (branch foreach-dsl #4)

I get the following score without OpenMP

Warmup: 1.1911 s, result 224 (displayed to avoid compiler optimizing warmup away)

Production implementation for tensor iteration - float64
Collected 1000 samples in 8.931 seconds
Average time: 8.930ms
Stddev  time: 0.345ms
Min     time: 8.723ms
Max     time: 14.785ms

Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413

Production implementation for tensor iteration - float64
Collected 1000 samples in 29.597 seconds
Average time: 29.597ms
Stddev  time: 1.125ms
Min     time: 27.002ms
Max     time: 38.606ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

and with OpenMP

Warmup: 1.1874 s, result 224 (displayed to avoid compiler optimizing warmup away)

Production implementation for tensor iteration - float64
Collected 1000 samples in 4.094 seconds
Average time: 4.092ms
Stddev  time: 0.206ms
Min     time: 3.897ms
Max     time: 5.459ms

Display output[[0,0]] to make sure it's not optimized away
-0.41973403633413

Production implementation for tensor iteration - float64
Collected 1000 samples in 24.025 seconds
Average time: 24.022ms
Stddev  time: 1.127ms
Min     time: 22.379ms
Max     time: 33.763ms

Display output[[0,0]] to make sure it's not optimized away
1.143903810108473

Potential explanations:

  • hitting the maximum memory bandwidth:
    • Parallel strided iteration also means parallel cache misses.
    • One way to alleviate that would be to use GCC/Clang builtin_prefetch
    • Note: this does not reflect experiment in https://github.com/zy97140/omp-benchmark-for-pytorch. Furthermore next element computation shouldn't require any memory access.
      Unfortunately, even if it doesn't require memory access, the CPU cannot leverage instruction level parallelism to execute the main computation at the same time as computing next elements location as there is branching.
  • accesses to shape and strides for strided iteration are slow:
    • This might happen if the compiler doesn't create a local copy or save them in registers if it doesn't detect that no shape/strides access mutate them.
  • strided iteration is running out of registers and/or instruction cache space
    • Unlikely are those are per-CPU and not per socket we should have a linear increase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant