Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use syrk to compute certain dot products more quickly and accurately #6932

Merged
merged 4 commits into from
Jan 6, 2016

Conversation

jakirkham
Copy link
Contributor

Fixes #6794

Related: #6930

Deals with the special cases of a.T @ a and a @ a.T using syrk instead of gemm.

@jakirkham
Copy link
Contributor Author

I think something might be wrong with the ld* parameter I am supplying here. If somebody knows a bit more about how this is working here for the gemm case can explain it to me, maybe I can get it fixed up.

Also, I have tried to check to see if an array is a transpose view using an ad hoc mechanism I came up with. If somebody has a better way to check this, I am open to suggestions.

@jakirkham
Copy link
Contributor Author

I tried building this with OpenBLAS locally and am not seeing this same error. Also, I am seeing an appropriate speed up here on transposed arrays (e.g. ~2x faster).

@jakirkham
Copy link
Contributor Author

Ok, I still can't say that this is totally correct, but it doesn't seem to be complaining about the ld* values. However, I am still not confident that this is correct. Suggestions?

@jakirkham jakirkham changed the title WIP: ENH: Use syrk to compute certain dot products more quickly and accurately ENH: Use syrk to compute certain dot products more quickly and accurately Jan 4, 2016
@jakirkham
Copy link
Contributor Author

Pinging: @njsmith @charris @argriffing

@jakirkham
Copy link
Contributor Author

Also, this probably needs a better way of checking the arrays are mere transposes of each other or some flags to suppress complaints about the current strategy if no better version can be found. Fixed.

* Otherwise, use gemm for all other cases.
*/
if (
(ap1 == PyArray_BASE(ap2)) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, this check is definitely not right. The base pointer doesn't have any semantics.

I think the correct check would be something like: same data pointer, and shapes are transposes of each other, and one is C contiguous and the other is F contiguous (the Trans checks are probably sufficient for the last part).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not super familiar with the NumPy C-API. So, this was my first attempt doing something significant with it. I'll think about how to rework it.

@charris
Copy link
Member

charris commented Jan 4, 2016

@jakirkham Trying to figure out when one matrix is the transpose of the other is a bit tricky. Furthermore, these days we tend to design functions to operate on stacks of matrices. On account of that, I think it would be better to have explicit functions that receive a single argument and perform A^T * A or A^H * A explicitly. Since it is often best if the multiplication would be enclosed in parenthesis anyway, the function also helps with that. i.e., instead of A.T @ B.T @ B @ A one simply writes fun(B @ A). The only problem I have is the name of the function. One would think such a common operation would have a name, but I've never heard of one. Maybe msqr, to complement the matrix square root, or mouter, since it is a form or outer product, or ata and aha.

Other possibilities are two argument functions for inv(A) @ B @ A or A^T @ B @ A, with B optional. The first form turns up in coordinate transforms, the second in transforming covariances and such. Anyway, I think proposals along those lines would be an excellent topic for discussion. OTOH, it isn't clear to me that the two argument forms buy anything in the way of efficiency or clarity. Hmm..., maybe they aren't such great ideas.

@jakirkham
Copy link
Contributor Author

@charris, regarding the first case (i.e. the one argument form) I think I still agree with you, which is why I had mentioned something similar earlier. ( #6794 (comment) ) That being said, it wasn't and still isn't clear to me how I was going to make traction on that problem. As the initial problem proposed by @njsmith seemed a bit simpler to get traction, I tried it here. Adding a new function or function(s) to do this may be better in the long run.

I am aware of these two argument problems (including this A.H @ B @ A) and have used them before. I think part of the reason that A.T @ A and A.H @ A were brought up is they can be implemented using standard BLAS routines syrk and herk. These should be faster than gemm and guarantee some symmetry, which I believe was the initial reason this was raised. I am not aware of any BLAS routines that do these two argument transforms in total on a general matrix. That being said, there may be value in having such a convenience function, but I think that is a different problem than the one being solved here. Also, given it will still likely use dot it probably can just be implemented in Python.

@njsmith
Copy link
Member

njsmith commented Jan 4, 2016

R names those operations crossprod(A) = A.T @ A, tcrossprod(A) = A @ A.T. Not sure if those names are worth stealing or not.

I think that's an orthogonal issue though. I don't see any reason not to optimize A @ A.T if that's what someone writes? The stacked matrices thing shouldn't be an issue because that just consists of a loop around the core operation, and the core operation is what we're talking about optimizing. And I don't see any particular challenge in detecting transposes... a cheap sufficient check is just a.data == b.data and a.shape == b.shape[::-1] and a.strides == b.strides[::-1]. (Possibly the strides check can be relaxed further, I'd have to think about it more. But this should catch at least 99% of cases.)

@jakirkham
Copy link
Contributor Author

I don't see any reason not to optimize A @ A.T if that's what someone writes?

This I definitely agree with, which is why I took the time to write this code at all. Regarding the NumPy optimization stuff, I believe both of you know more than me in this area so I'll let you guys hash it out. As for the checks, I think I have fixed them, but they probably should be fused into one conditional to reduce the rechecking.

@njsmith
Copy link
Member

njsmith commented Jan 4, 2016

Might be good to add tests for the trickier cases like, A @ A[:-1, :].T, A @ A[:, :-1].T have matching data and strides but not shape, A[::2, :] @A[:A.shape[0] // 2, :].T` have matching data and shapes but not strides, etc.

@jakirkham
Copy link
Contributor Author

Ok, sure that's a good idea. I'll try to work on the tests tomorrow. I've tried to combine the two checks a bit so that they are not repeated for each case.

@jakirkham jakirkham force-pushed the opt_dot_trans branch 5 times, most recently from cb441e8 to c4dff61 Compare January 4, 2016 14:35
@jakirkham
Copy link
Contributor Author

@njsmith, I've added some tests that you have suggested. They have checked out locally. Travis should be done soon. If there are any more you can think of, please let me know.

@jakirkham
Copy link
Contributor Author

Travis checks out.

@njsmith
Copy link
Member

njsmith commented Jan 6, 2016

This is looking good to me. (@charris: do you still have any objections?)

@jakirkham: One last thing: this is a pretty awesome user-visible feature, so we should advertise it in the release notes. Could you add a short note to doc/release/1.11.0-notes.rst?

@njsmith
Copy link
Member

njsmith commented Jan 6, 2016

...I guess if you're feeling inspired you could also usefully add some benchmarks to the benchmarks/ directory for dot(A, A.T) and functions that use dot(A, A.T), like cov. But I don't really know the details of how this works, and it's not necessary for merging.

@jakirkham
Copy link
Contributor Author

One last thing: this is a pretty awesome user-visible feature, so we should advertise it in the release notes. Could you add a short note to doc/release/1.11.0-notes.rst?

Absolutely, I added a note under new features. Though I could see it going under improvements. Please let me know if I need to move it or change wording.

...I guess if you're feeling inspired you could also usefully add some benchmarks to the benchmarks/ directory for dot(A, A.T) and functions that use dot(A, A.T), like cov. But I don't really know the details of how this works, and it's not necessary for merging.

Sure, I have no problem doing this, but I may need a pointer on how to go about doing this. In the worst case, I can copy and tweak existing benchmarks to serve our ends here.

@jakirkham
Copy link
Contributor Author

Updated the benchmarks so they would be sorted in the output next to each other to make it easy to compare.

@jakirkham
Copy link
Contributor Author

Please let me know if there is anything else needed. Otherwise, I am assuming this is complete.

Also, I have tried to rebase against master a few times, but there seem to a bunch of updates going in. Instead of clogging up the build queue with rebases of this PR, I am going to let it sit. If you would like to merge and want me to rebase beforehand, feel free to give me a nudge and I will do so.

@njsmith
Copy link
Member

njsmith commented Jan 6, 2016

FYI as a general rule, there's no need to rebase at all unless master has changed in a way that breaks your patch, or else you want to rewrite history for some other reason (e.g. fix commit messages or squash commits together).

Anyway, I think this is ready to go -- thanks @jakirkham!

njsmith added a commit that referenced this pull request Jan 6, 2016
ENH: Use `syrk` to compute certain dot products more quickly and accurately
@njsmith njsmith merged commit 25c8d1c into numpy:master Jan 6, 2016
@jakirkham jakirkham deleted the opt_dot_trans branch January 7, 2016 03:52
@jakirkham
Copy link
Contributor Author

Ah ok, I think in previous PRs that I have done to numpy I had been encouraged to rebase on master if I was out of date. I was assuming this was to keep the git graph from being a giant indecipherable mess. If that is not necessary, I will avoid doing rebases strictly to move commits onto master unless explicitly asked.

Cool, thanks for your help in guiding me in this effort @njsmith. Also, thank you everyone for your suggestions and feedback. If anything else comes up with regards to this, feel free to to let me know.

@rgommers
Copy link
Member

rgommers commented Jan 7, 2016

Benchmarks show the 2x improvement now (for a.T and not for a.T.copy() as expected): http://pv.github.io/numpy-bench/#bench_linalg.Eindot.time_dot_trans_a_at

@jakirkham
Copy link
Contributor Author

Oh wow, that's super cool. I didn't realize you had an awesome benchmark viewer for NumPy. Also, it is good to see it checks out. Thanks for sharing @rgommers.

@rgommers
Copy link
Member

rgommers commented Jan 7, 2016

Yes we should advertise that better. I guess @pv didn't advertise because it's on his personal site now. But it's pretty cool indeed.

@rgommers rgommers added this to the 1.11.0 release milestone Jan 7, 2016
@jakirkham
Copy link
Contributor Author

For those who plan to use this on Mac with OpenBLAS, something to be aware of ( OpenMathLib/OpenBLAS#730 ). This does not affect Linux.

@jakirkham
Copy link
Contributor Author

Turns out this actually optimizes numpy.matmul, as well. However, we didn't notice that at the time. A benchmark for numpy.matmul was added in this PR ( #7034 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants