ENH: Use `syrk` to compute certain dot products more quickly and accurately #6932

jakirkham · 2016-01-04T00:44:34Z

Related: #6930

Deals with the special cases of a.T @ a and a @ a.T using syrk instead of gemm.

jakirkham · 2016-01-04T00:51:12Z

I think something might be wrong with the ld* parameter I am supplying here. If somebody knows a bit more about how this is working here for the gemm case can explain it to me, maybe I can get it fixed up.

Also, I have tried to check to see if an array is a transpose view using an ad hoc mechanism I came up with. If somebody has a better way to check this, I am open to suggestions.

jakirkham · 2016-01-04T01:26:33Z

I tried building this with OpenBLAS locally and am not seeing this same error. Also, I am seeing an appropriate speed up here on transposed arrays (e.g. ~2x faster).

jakirkham · 2016-01-04T03:18:22Z

Ok, I still can't say that this is totally correct, but it doesn't seem to be complaining about the ld* values. However, I am still not confident that this is correct. Suggestions?

jakirkham · 2016-01-04T03:19:10Z

Pinging: @njsmith @charris @argriffing

jakirkham · 2016-01-04T03:31:12Z

~~Also, this probably needs a better way of checking the arrays are mere transposes of each other or some flags to suppress complaints about the current strategy if no better version can be found.~~ Fixed.

njsmith · 2016-01-04T03:40:20Z

numpy/core/src/multiarray/cblasfuncs.c

+         * Otherwise, use gemm for all other cases.
+         */
+        if (
+            (ap1 == PyArray_BASE(ap2)) &&


Ah, yeah, this check is definitely not right. The base pointer doesn't have any semantics.

I think the correct check would be something like: same data pointer, and shapes are transposes of each other, and one is C contiguous and the other is F contiguous (the Trans checks are probably sufficient for the last part).

Sorry, not super familiar with the NumPy C-API. So, this was my first attempt doing something significant with it. I'll think about how to rework it.

charris · 2016-01-04T04:01:41Z

@jakirkham Trying to figure out when one matrix is the transpose of the other is a bit tricky. Furthermore, these days we tend to design functions to operate on stacks of matrices. On account of that, I think it would be better to have explicit functions that receive a single argument and perform A^T * A or A^H * A explicitly. Since it is often best if the multiplication would be enclosed in parenthesis anyway, the function also helps with that. i.e., instead of A.T @ B.T @ B @ A one simply writes fun(B @ A). The only problem I have is the name of the function. One would think such a common operation would have a name, but I've never heard of one. Maybe msqr, to complement the matrix square root, or mouter, since it is a form or outer product, or ata and aha.

Other possibilities are two argument functions for inv(A) @ B @ A or A^T @ B @ A, with B optional. The first form turns up in coordinate transforms, the second in transforming covariances and such. Anyway, I think proposals along those lines would be an excellent topic for discussion. OTOH, it isn't clear to me that the two argument forms buy anything in the way of efficiency or clarity. Hmm..., maybe they aren't such great ideas.

jakirkham · 2016-01-04T05:54:51Z

@charris, regarding the first case (i.e. the one argument form) I think I still agree with you, which is why I had mentioned something similar earlier. ( #6794 (comment) ) That being said, it wasn't and still isn't clear to me how I was going to make traction on that problem. As the initial problem proposed by @njsmith seemed a bit simpler to get traction, I tried it here. Adding a new function or function(s) to do this may be better in the long run.

I am aware of these two argument problems (including this A.H @ B @ A) and have used them before. I think part of the reason that A.T @ A and A.H @ A were brought up is they can be implemented using standard BLAS routines syrk and herk. These should be faster than gemm and guarantee some symmetry, which I believe was the initial reason this was raised. I am not aware of any BLAS routines that do these two argument transforms in total on a general matrix. That being said, there may be value in having such a convenience function, but I think that is a different problem than the one being solved here. Also, given it will still likely use dot it probably can just be implemented in Python.

njsmith · 2016-01-04T06:04:48Z

R names those operations crossprod(A) = A.T @ A, tcrossprod(A) = A @ A.T. Not sure if those names are worth stealing or not.

I think that's an orthogonal issue though. I don't see any reason not to optimize A @ A.T if that's what someone writes? The stacked matrices thing shouldn't be an issue because that just consists of a loop around the core operation, and the core operation is what we're talking about optimizing. And I don't see any particular challenge in detecting transposes... a cheap sufficient check is just a.data == b.data and a.shape == b.shape[::-1] and a.strides == b.strides[::-1]. (Possibly the strides check can be relaxed further, I'd have to think about it more. But this should catch at least 99% of cases.)

jakirkham · 2016-01-04T06:26:54Z

I don't see any reason not to optimize A @ A.T if that's what someone writes?

This I definitely agree with, which is why I took the time to write this code at all. Regarding the NumPy optimization stuff, I believe both of you know more than me in this area so I'll let you guys hash it out. As for the checks, I think I have fixed them, but they probably should be fused into one conditional to reduce the rechecking.

njsmith · 2016-01-04T06:33:39Z

Might be good to add tests for the trickier cases like, A @ A[:-1, :].T, A @ A[:, :-1].T have matching data and strides but not shape, A[::2, :] @A[:A.shape[0] // 2, :].T` have matching data and shapes but not strides, etc.

jakirkham · 2016-01-04T06:41:57Z

Ok, sure that's a good idea. I'll try to work on the tests tomorrow. I've tried to combine the two checks a bit so that they are not repeated for each case.

jakirkham · 2016-01-04T15:27:46Z

@njsmith, I've added some tests that you have suggested. They have checked out locally. Travis should be done soon. If there are any more you can think of, please let me know.

jakirkham · 2016-01-04T16:10:31Z

Travis checks out.

njsmith · 2016-01-06T01:54:09Z

This is looking good to me. (@charris: do you still have any objections?)

@jakirkham: One last thing: this is a pretty awesome user-visible feature, so we should advertise it in the release notes. Could you add a short note to doc/release/1.11.0-notes.rst?

njsmith · 2016-01-06T01:58:37Z

...I guess if you're feeling inspired you could also usefully add some benchmarks to the benchmarks/ directory for dot(A, A.T) and functions that use dot(A, A.T), like cov. But I don't really know the details of how this works, and it's not necessary for merging.

jakirkham · 2016-01-06T03:24:34Z

One last thing: this is a pretty awesome user-visible feature, so we should advertise it in the release notes. Could you add a short note to doc/release/1.11.0-notes.rst?

Absolutely, I added a note under new features. Though I could see it going under improvements. Please let me know if I need to move it or change wording.

...I guess if you're feeling inspired you could also usefully add some benchmarks to the benchmarks/ directory for dot(A, A.T) and functions that use dot(A, A.T), like cov. But I don't really know the details of how this works, and it's not necessary for merging.

Sure, I have no problem doing this, but I may need a pointer on how to go about doing this. In the worst case, I can copy and tweak existing benchmarks to serve our ends here.

jakirkham · 2016-01-06T18:20:28Z

Updated the benchmarks so they would be sorted in the output next to each other to make it easy to compare.

… a.T`.

… accurately in certain special cases.

… @ A` and `A @ A.T`.

…ata, which should be optimized, versus the same configuration without shared data.

jakirkham · 2016-01-06T20:53:32Z

Please let me know if there is anything else needed. Otherwise, I am assuming this is complete.

Also, I have tried to rebase against master a few times, but there seem to a bunch of updates going in. Instead of clogging up the build queue with rebases of this PR, I am going to let it sit. If you would like to merge and want me to rebase beforehand, feel free to give me a nudge and I will do so.

njsmith · 2016-01-06T23:28:29Z

FYI as a general rule, there's no need to rebase at all unless master has changed in a way that breaks your patch, or else you want to rewrite history for some other reason (e.g. fix commit messages or squash commits together).

Anyway, I think this is ready to go -- thanks @jakirkham!

ENH: Use `syrk` to compute certain dot products more quickly and accurately

jakirkham · 2016-01-07T03:58:22Z

Ah ok, I think in previous PRs that I have done to numpy I had been encouraged to rebase on master if I was out of date. I was assuming this was to keep the git graph from being a giant indecipherable mess. If that is not necessary, I will avoid doing rebases strictly to move commits onto master unless explicitly asked.

Cool, thanks for your help in guiding me in this effort @njsmith. Also, thank you everyone for your suggestions and feedback. If anything else comes up with regards to this, feel free to to let me know.

rgommers · 2016-01-07T21:20:07Z

Benchmarks show the 2x improvement now (for a.T and not for a.T.copy() as expected): http://pv.github.io/numpy-bench/#bench_linalg.Eindot.time_dot_trans_a_at

jakirkham · 2016-01-07T21:24:36Z

Oh wow, that's super cool. I didn't realize you had an awesome benchmark viewer for NumPy. Also, it is good to see it checks out. Thanks for sharing @rgommers.

rgommers · 2016-01-07T21:25:38Z

Yes we should advertise that better. I guess @pv didn't advertise because it's on his personal site now. But it's pretty cool indeed.

jakirkham · 2016-01-08T19:44:00Z

For those who plan to use this on Mac with OpenBLAS, something to be aware of ( OpenMathLib/OpenBLAS#730 ). This does not affect Linux.

jakirkham · 2016-01-17T04:04:44Z

Turns out this actually optimizes numpy.matmul, as well. However, we didn't notice that at the time. A benchmark for numpy.matmul was added in this PR ( #7034 ).

jakirkham mentioned this pull request Jan 4, 2016

TST: Verify dot products with matrix transposes #6930

Merged

jakirkham force-pushed the opt_dot_trans branch from b41860e to 95049a0 Compare January 4, 2016 01:24

jakirkham force-pushed the opt_dot_trans branch from 95049a0 to 520e209 Compare January 4, 2016 03:15

jakirkham changed the title ~~WIP: ENH: Use syrk to compute certain dot products more quickly and accurately~~ ENH: Use syrk to compute certain dot products more quickly and accurately Jan 4, 2016

njsmith reviewed Jan 4, 2016
View reviewed changes

jakirkham force-pushed the opt_dot_trans branch from 520e209 to c52facb Compare January 4, 2016 06:25

jakirkham force-pushed the opt_dot_trans branch from c52facb to 6e4950a Compare January 4, 2016 06:40

jakirkham force-pushed the opt_dot_trans branch 5 times, most recently from cb441e8 to c4dff61 Compare January 4, 2016 14:35

charris added 01 - Enhancement component: numpy._core labels Jan 6, 2016

jakirkham mentioned this pull request Jan 6, 2016

Optimization of a @ a.H style operations using BLAS herk routines #6951

Open

jakirkham force-pushed the opt_dot_trans branch 2 times, most recently from 44b3077 to 0197427 Compare January 6, 2016 17:34

jakirkham force-pushed the opt_dot_trans branch from 0197427 to 0576eb1 Compare January 6, 2016 20:02

jakirkham added 4 commits January 6, 2016 15:03

ENH: Added the helper function syrk that computes a.T @ a or `a @…

fee3ccb

… a.T`.

ENH: Use the helper function syrk to compute dot more quickly and…

924e08f

… accurately in certain special cases.

DOC: Update the 1.11.0 release notes to mention optimizations of `A.T…

01acef8

… @ A` and `A @ A.T`.

BENCH: Add benchmarks between an array and its transpose that share d…

8d8a74d

…ata, which should be optimized, versus the same configuration without shared data.

jakirkham force-pushed the opt_dot_trans branch from 0576eb1 to 8d8a74d Compare January 6, 2016 20:03

njsmith added a commit that referenced this pull request Jan 6, 2016

Merge pull request #6932 from jakirkham/opt_dot_trans

25c8d1c

ENH: Use `syrk` to compute certain dot products more quickly and accurately

njsmith merged commit 25c8d1c into numpy:master Jan 6, 2016

This was referenced Jan 6, 2016

ENH: usecols now accepts an int when only one column has to be read #6656

Merged

memcpy-based fast, typed shuffle. #6933

Merged

jakirkham deleted the opt_dot_trans branch January 7, 2016 03:52

jakirkham mentioned this pull request Jan 7, 2016

Use BLAS syrk to speed up some dot-based operations nanshe-org/nanshe#336

Closed

jakirkham mentioned this pull request Jan 7, 2016

ENH, MAINT: Refactor PyArray_InnerProduct to use PyArray_MatrixProduct2 #6968

Merged

rgommers added this to the 1.11.0 release milestone Jan 7, 2016

This was referenced Jan 9, 2016

STY: Place conditional opening braces on the same line #6979

Merged

NumPy syrk-based numpy.dot patch nanshe-org/nanshe-build#24

Merged

jakirkham mentioned this pull request Jan 16, 2016

BENCH, DOC: Benchmark matmul and update documentation #7034

Merged

mattip mentioned this pull request Jul 5, 2018

WIP: ENH: Matmul as ufunc #11133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Use `syrk` to compute certain dot products more quickly and accurately #6932

ENH: Use `syrk` to compute certain dot products more quickly and accurately #6932

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith Jan 4, 2016

jakirkham Jan 4, 2016

charris commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith commented Jan 6, 2016

njsmith commented Jan 6, 2016

jakirkham commented Jan 6, 2016

jakirkham commented Jan 6, 2016

jakirkham commented Jan 6, 2016

njsmith commented Jan 6, 2016

jakirkham commented Jan 7, 2016

rgommers commented Jan 7, 2016

jakirkham commented Jan 7, 2016

rgommers commented Jan 7, 2016

jakirkham commented Jan 8, 2016

jakirkham commented Jan 17, 2016

ENH: Use syrk to compute certain dot products more quickly and accurately #6932

ENH: Use syrk to compute certain dot products more quickly and accurately #6932

Conversation

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith Jan 4, 2016

Choose a reason for hiding this comment

jakirkham Jan 4, 2016

Choose a reason for hiding this comment

charris commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

jakirkham commented Jan 4, 2016

njsmith commented Jan 6, 2016

njsmith commented Jan 6, 2016

jakirkham commented Jan 6, 2016

jakirkham commented Jan 6, 2016

jakirkham commented Jan 6, 2016

njsmith commented Jan 6, 2016

jakirkham commented Jan 7, 2016

rgommers commented Jan 7, 2016

jakirkham commented Jan 7, 2016

rgommers commented Jan 7, 2016

jakirkham commented Jan 8, 2016

jakirkham commented Jan 17, 2016

ENH: Use `syrk` to compute certain dot products more quickly and accurately #6932

ENH: Use `syrk` to compute certain dot products more quickly and accurately #6932