OMP accelerated loops causing massive slow-down #1063

cpviolator · 2020-10-02T04:47:15Z

In the IRAM eigensolver there is some quarantined code in qrIteration function that uses OMP. When activated it wall cause a x100 slowdown on the loop. The routine was prototyped in laptop code

https://github.com/cpviolator/VOATOL/blob/master/include/algoHelpers.h#L470

and gives good speed-up there. The QUDA code is here:

https://github.com/lattice/quda/blob/feature/arnoldi/lib/eig_iram.cpp#L272

Not sure what the problem is as the #pragmas are pretty straight forward.

The text was updated successfully, but these errors were encountered:

kostrzewa · 2020-10-02T08:19:31Z

Interesting. Note that the QUDA code is also buggy as temp is not thread-local. As for performance, this has a huge number of implicit barriers.

kostrzewa · 2020-10-02T08:21:25Z

What's the usual size of n_kr here? If too small, then the static work assignment with a chunk size of 32 will lead to it running essentially single-threaded (or with few threads).

kostrzewa · 2020-10-02T08:25:27Z

Also, as i increases, the amount of work in the interior loop over j will inevitably become small, compounding the issue in the above comment.

cpviolator · 2020-10-02T08:34:36Z

Thanks for pointing out the thread local bug. I had it coded properly, then deleted it all, but wanted to add the quarantined code back in so copied it sloppily :(

I did have the following booleans in there at one point:

      // Continue for the other columns
if(n_kr - (i+1)) > 64) {
#pragma omp parallel for schedule(static,32)
      for(int j=i+1; j < n_kr; j++) {
	Complex tempp = R[i][j];
	R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);      
	R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
      }
} else {
      for(int j=i+1; j < n_kr; j++) {
        temp = R[i][j];
	R[i][j] -= (T11 * temp + T12 * R[i+1][j]);      
	R[i+1][j] -= (T21 * temp + T22 * R[i+1][j]);
     }
}

(and similar for the other loops) as a heuristic way of ensuring OpenMP offload only when it might be worth it. It seemed to work well on my laptop, but here in QUDA, any OMP pragmas destroy the loop speed.

cpviolator · 2020-10-02T08:35:52Z

BTW, for a decent calculation size, n_kr would be around 1024.

cpviolator · 2020-10-02T08:38:18Z

I did wonder it if was data ordering. On my laptop, I use Eigen objects when are in row major, and the matrices Q, and R are updated row by row. However, when I refactored the above to use row majored object, the same slow down was seen.

kostrzewa · 2020-10-03T15:24:47Z

A possiblity might be that your Eigen-based code is simply very slow, so it scales well with the number of threads.

kostrzewa · 2020-10-04T11:47:28Z

Which architecture did you observe the slowdown in QUDA on, by the way? If this was on Power9, then I wouldn't be too surprised. Look at the behaviour of the stream triad on a dual-Power9 with 2x16 cores:

kostrzewa · 2020-10-04T11:52:53Z

Also on EPYC, threading overheads for small working sets clearly dominate multi-threaded bandwidth:

cpviolator · 2020-10-05T23:52:01Z

@kostrzewa These plots are gorgeous!

I tested on two architectures: Haswell and Power9. I actually resolved the issue on Haswell because there were a proliferation of background jobs going on during the OMP tests. Once they had all calmed down (and I used an all important export MV2_ENABLE_AFFINITY=0 then I saw a healthy 2x speed up with 6 OMP cores Vs a LAPACKE accelerated Eigen. A lot of that speed up comes from the fact that we can control the tolerance on the QR iteration. We find that for a tolerance of 1e-n on eigenpair residua, the QR tolerance is stable for 1e-(n+1) and smaller. We can set the maximum iterations used for Eigen's Schur decomposition, but this does not have a reliable mapping to tolerance, so introduces instability. The final version of the OMP code I settled on is:

      // Continue for the other columns                                                                                                                                           
#ifdef _OPENMP
#pragma omp parallel for schedule(static,32)
#endif
      for(int j=i+1; j < n_kr; j++) {
        Complex tempp = R[i][j];
        R[i][j] -= (T11 * tempp + T12 * R[i+1][j]);
        R[i+1][j] -= (T21 * tempp + T22 * R[i+1][j]);
      }
    }

    // Rotate R and V, i.e. H->RQ. V->VQ                                                                                                                                          
    // Loop over columns of upper Hessenberg                                                                                                                                      
    for(int j = 0; j < n_kr - 1; j++) {
      if(abs(R11[j]) > tol) {
        // Loop over the rows, up to the sub diagonal element i=j+1                                                                                                               
#ifdef _OPENMP
#pragma omp parallel
        {
#pragma omp for schedule(static,32) nowait
#endif
          for(int i = 0; i < j+2; i++) {
            Complex tempp = R[i][j];
            R[i][j] -= (R11[j] * tempp + R12[j] * R[i][j+1]);
            R[i][j+1] -= (R21[j] * tempp + R22[j] * R[i][j+1]);
          }
#ifdef _OPENMP
#pragma omp for schedule(static,32) nowait
#endif
          for(int i = 0; i < n_kr; i++) {
            Complex tempp = Q[i][j];
            Q[i][j] -= (R11[j] * tempp + R12[j] * Q[i][j+1]);
            Q[i][j+1] -= (R21[j] * tempp + R22[j] * Q[i][j+1]);
          }
        }
      }
    }

I've yet to observe this speed up on Power9. I'm wondering if there is some equivalent MV2_ENABLE_AFFINITY on Summit I need to set.

cpviolator · 2020-10-06T02:36:16Z

So, on Summit's Power9, the command to use is:

jsrun --nrs 1 -a1 -g1 -c${CPU} -EOMP_NUM_THREADS=${CPU} -brs -l gpu-gpu

I tested with for CPU in 1 2 4 8 16 32 ; do using the flag --eig-use-eigen-qr false to ensure the OMP code path, using schedule(static,32) for all loops, with the following results (relevant for the dense algebra):
N_CPU = 1: ~90 secs

eigenEV     = 12.020191 secs (  5.15%),	 with        6 calls at 2.003365e+06 us per call
host compute     = 78.627692 secs (  33.7%),	 with       10 calls at 7.862769e+06 us per call

N_CPU=2: ~56 secs

eigenEV     = 7.007859 secs (  3.56%),	 with        6 calls at 1.167976e+06 us per call
host compute     = 49.272701 secs (  25.1%),	 with       10 calls at 4.927270e+06 us per call

N_CPU=4: ~44 secs

eigenEV     = 4.429335 secs (  2.45%),	 with        6 calls at 7.382225e+05 us per call
host compute     = 36.023938 secs (  19.9%),	 with       10 calls at 3.602394e+06 us per call

N_CPU=8: ~ 42 secs

eigenEV     = 3.145898 secs (  1.73%),	 with        6 calls at 5.243163e+05 us per call
host compute     = 38.750711 secs (  21.3%),	 with       10 calls at 3.875071e+06 us per call

N_CPU=16: 35 secs

eigenEV     = 2.503287 secs (  1.43%),	 with        6 calls at 4.172145e+05 us per call
host compute     = 32.357466 secs (  18.5%),	 with       10 calls at 3.235747e+06 us per call

N_CPU=32: ~75 secs

eigenEV     = 2.193851 secs (  1.02%),	 with        6 calls at 3.656418e+05 us per call
host compute     = 72.706578 secs (  33.8%),	 with       10 calls at 7.270658e+06 us per call

So using N_CPU=16 gives the best speed-up. In contrast, the baseline Eigen implementation with LAPACKE acceleration, using 1 CPU and 16 CPUS gives:
EIGEN, N_CPU=1: ~81 secs

eigenEV     = 10.896648 secs (  4.91%),	 with        6 calls at 1.816108e+06 us per call
eigenQR     = 42.431922 secs (  19.1%),	 with        6 calls at 7.071987e+06 us per call
host compute     = 28.158312 secs (  12.7%),	 with        4 calls at 7.039578e+06 us per call

EIGEN N_CPU=16: ~66secs

eigenEV     = 2.430413 secs (  1.22%),	 with        6 calls at 4.050688e+05 us per call
eigenQR     = 42.681786 secs (  21.4%),	 with        6 calls at 7.113631e+06 us per call
host compute     = 11.136929 secs (  5.59%),	 with        4 calls at 2.784232e+06 us per call

So it looks like the best speed up comes from using the host OMP accelerated code, 16 CPUs, and a QR tolerance set to one order of magnitude smaller than the eigensolver tolerance.

kostrzewa · 2020-10-06T10:05:55Z

@kostrzewa These plots are gorgeous!

Cheers, these are essentially one-liners in ggplot2 (https://ggplot2.tidyverse.org/).

I'm happy that you've found scaling after all!

cpviolator closed this as completed Oct 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OMP accelerated loops causing massive slow-down #1063

OMP accelerated loops causing massive slow-down #1063

cpviolator commented Oct 2, 2020 •

edited

kostrzewa commented Oct 2, 2020

kostrzewa commented Oct 2, 2020 •

edited

kostrzewa commented Oct 2, 2020

cpviolator commented Oct 2, 2020 •

edited

cpviolator commented Oct 2, 2020

cpviolator commented Oct 2, 2020

kostrzewa commented Oct 3, 2020

kostrzewa commented Oct 4, 2020

kostrzewa commented Oct 4, 2020

cpviolator commented Oct 5, 2020

cpviolator commented Oct 6, 2020 •

edited

kostrzewa commented Oct 6, 2020

OMP accelerated loops causing massive slow-down #1063

OMP accelerated loops causing massive slow-down #1063

Comments

cpviolator commented Oct 2, 2020 • edited

kostrzewa commented Oct 2, 2020

kostrzewa commented Oct 2, 2020 • edited

kostrzewa commented Oct 2, 2020

cpviolator commented Oct 2, 2020 • edited

cpviolator commented Oct 2, 2020

cpviolator commented Oct 2, 2020

kostrzewa commented Oct 3, 2020

kostrzewa commented Oct 4, 2020

kostrzewa commented Oct 4, 2020

cpviolator commented Oct 5, 2020

cpviolator commented Oct 6, 2020 • edited

kostrzewa commented Oct 6, 2020

cpviolator commented Oct 2, 2020 •

edited

kostrzewa commented Oct 2, 2020 •

edited

cpviolator commented Oct 2, 2020 •

edited

cpviolator commented Oct 6, 2020 •

edited