Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPy tries to be too smart when dispatching to BLAS (gemm vs gemv), which prevents using MKL's strict CNR in full #13732

Open
aldanor opened this issue Jun 7, 2019 · 14 comments

Comments

@aldanor
Copy link

aldanor commented Jun 7, 2019

Here's the actual use case: Intel has recently released a "Strict CNR mode" for MKL (in 2019.3 release), see: https://software.intel.com/en-us/mkl-linux-developer-guide-reproducibility-conditions

Basically, for a small subset of BLAS functions, it guarantees strict bit-wise reproducibility regardless of the number of threads used etc:

In strict CNR mode, Intel MKL provides bitwise reproducible results for a limited set of functions and code branches even when the number of threads changes. These routines and branches support strict CNR mode (64-bit libraries only):

  • ?gemm, ?symm, ?hemm, ?trsm and their CBLAS equivalents (cblas_?gemm, cblas_?symm, cblas_?hemm, and cblas_?trsm).
  • Intel® Advanced Vector Extensions 2 (Intel® AVX2) or Intel® Advanced Vector Extensions 512 (Intel® AVX-512).

For whatever reason, *gemv functions are not listed, but *gemm are - so all your matrix-by-matrix dot products can now be numerically stable, but not matrix-vector...

Is there are way to force numpy to use e.g. dgemm/sgemm and not dgemv/sgemv, when multiplying (1 x n) by (n x m)?

I've stumbled upon this piece of code which, IIUC, tries to be "smart" and treats single-row/single-column matrices as vectors when dispatching to blas routines, so you can't force it to use 'mm' versions of blas functions instead of 'mv' even if you want to:

static MatrixShape
_select_matrix_shape(PyArrayObject *array)
{
switch (PyArray_NDIM(array)) {
case 0:
return _scalar;
case 1:
if (PyArray_DIM(array, 0) > 1)
return _column;
return _scalar;
case 2:
if (PyArray_DIM(array, 0) > 1) {
if (PyArray_DIM(array, 1) == 1)
return _column;
else
return _matrix;
}
if (PyArray_DIM(array, 1) == 1)
return _scalar;
return _row;
}
return _matrix;
}

@aldanor
Copy link
Author

aldanor commented Jun 7, 2019

@oleksandr-pavlyk maybe you would know? (it's also a bit baffling that mm functions support strict cnr and not mv, and not the other way around)

@mhvk
Copy link
Contributor

mhvk commented Jun 7, 2019

@aldanor - we do want code to be fast so the special-casing makes sense. The work-around would be to ensure the arrays always have at least 2 dimensions (being careful to add the 1 in the right place.

@aldanor
Copy link
Author

aldanor commented Jun 7, 2019

@mhvk That's totally understandable (I haven't tested, but I'm guessing gemv would be slightly faster than gemm in general on same inputs). I was just wondering what's the current workaround to force it to use gemm (and maybe other folks can benefit from it as well if anyone cares about strict CNR).

I actually have one idea, for matrix-vector dot products instead of doing np.dot(X, y), one could do an np.dot(X, y.reshape((1, -1, 1))).ravel(). If I understand correctly, for 3 dimensions and higher the special-casing will be skipped and for the second argument the second-last dimension will be used.

I'll try to check it on Monday and report back (whether it would dispatch to gemm).

@mhvk
Copy link
Contributor

mhvk commented Jun 7, 2019

@aldanor - yes, your work-around is what I would do too (not sure you need to make it 3-dimensional; isn't .reshape(-1, 1) good enough?)

Your issue does suggest that it would still be a good idea to have separate matmat, vecmat and matvec gufuncs - as originally suggested by @charris, I think.

@charris
Copy link
Member

charris commented Jun 7, 2019

I looked at using matrix matrix multiply instead of matrix vector at some point and, IIRC, there was no difference in speed. No doubt that might depend on the library, but I think a good gemm implemenation will special case what looks like a vector if there is an advantage to doing so.

@aldanor
Copy link
Author

aldanor commented Jun 7, 2019

@mhvk I think it has to be at least 3-dimensional, if you check out the snippet I posted above - if it’s 2-D and one of the dimensions is of size 1, numpy starts being smart and treats it as a vector.

Yea, explicit matmat, vecmat and matvec would be awesome to have:

  • It’s more predictable which cblas routines get called with minimum behind the scenes magic, so you can control things like mkl CNR
  • If the dot is called in a very hot loop where every microsecond counts, it won’t have to do all this magic repeatedly and needlessly thus making it (marginally) faster

@seberg
Copy link
Member

seberg commented Jun 7, 2019

Doesn't scipy provide thinner wrappers around much of Lapack/BLAS nowadays: https://docs.scipy.org/doc/scipy/reference/linalg.blas.html which might fit better if you really want to know which function gets called (even if it is likely less convenient).

@oleksandr-pavlyk
Copy link
Contributor

I have asked around, and there is not immediate plants to make ?gemv support strict CNR, but I was told its should be possible to reuse ?gemm.

@aldanor
Copy link
Author

aldanor commented Jun 7, 2019

@charris I can try running some benchmarks on mkl and openblas next week.

My thoughts exactly - I’d expect a mature gemm implementation to be smart enough to figure things like 1-sized dimensions automatically. Same like it would probably treat alpha=1 case separately (which is what numpy is calling). If there’s no visible performance difference, numpy could just use gemm then (this is still orthogonal to having explicit matvec etc public functions).

@mhvk
Copy link
Contributor

mhvk commented Jun 7, 2019

@aldanor - that would be very useful. I agree that we should punt these kinds of decisions to the library, keeping our code simpler, but it would be good to be sure that we don't get hit with a big performance penalty.

p.s. yes, you're right about the snippet - I misread _DIM for _NDIM - another argument for just getting rid of it!

@mhvk
Copy link
Contributor

mhvk commented Jun 7, 2019

If you can, do add some benchmarks to numpy/benchmarks/benchmarks/bench_linalg.py - right now there is only a vector one as part of a more complex calculation.

@redditur
Copy link

redditur commented Jun 24, 2019

The performance issue is a bit of a red-herring in my opinion. We all want things to be performant; I'm not sure tying users hands with magic is the sensible approach though. As @aldanor points out you actually lose performance in tight loops because of these checks.

+1 for explicit matmat, vecmat and matvec functions. At the very least it would make code more transparent.

@mhvk
Copy link
Contributor

mhvk commented Jun 24, 2019

It does seem there is some consensus on a path forward:

  1. Add benchmark tests
  2. Make PR removing the special-casing of vector-matrix operations for the BLAS calls.
  3. Introduce explicit matmat, vecmat, matvec functions.

@aldanor
Copy link
Author

aldanor commented Jun 25, 2019

Hi folks, I'm back with some benchmark results - I'm yet to read through them myself though (there's also strict cnr mode tests which are not directly related to this thread but may be interesting nevertheless).

https://gist.github.com/aldanor/5bb9f1ff3577a4c6d35c267db75e47bd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants