-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add contract
: optimizing numpy's einsum expression
#5488
Conversation
5 False bcde,fb->cdef gc,hd,cdef->efgh | ||
5 False cdef,gc->defg hd,defg->efgh | ||
5 False defg,hd->efgh efgh->efgh | ||
>>> ein_result = np.einsum('ea,fb,abcd,gc,hd->efgh', C, C, I, C, C, path=opt_path[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no line calculating opt_result
, and this one computing ein_result
is passing a wrong keyword argument pat
to np.einsum
.
For the opt_path option, it would be nice to return a machine readable version, i.e., something that I could paste in to substitute for My vote is also for a name like |
@shoyer # Regular einsum
np.einsum('aab,bc', op1, op2)
# Machine readable path return
np.dot(np.einsum('aab->ab', op1), op2) This can become exceedingly long for the more complicated paths. I have written cache like algorithms for my own usage of this type of script, I will look into adding them here. I was thinking |
Shouldn't the name be "einsum"? Names are for users, and users care about
|
@njsmith I was starting to think along the same lines. The main issues that I see:
|
Yes, of course. If performance is a concern, my suggestion would be to write this in Cython, not C. That, along with caching, should take care of any performance concerns. It doesn't particularly matter to me what form the path is returned in as long as it can be used as an argument to the function to skip the optimization step. Generating Python code is probably not a path we want to go down. |
+1 as long as it doesn't break compatibility. With the long term prospect of moving more stuff down into multiarray, we might want to look into reorganizing that directory to separate out large, semi-standalone parts like this. |
I have been fairly busy this week, but I wanted to drop in and say that I am looking into this. I really like pushing everything into the @charris Can you expand a bit on reorganizing the @shoyer Writing this in cython would be much easier, is the C conversion done with the cythonize script? With the caching, we would need to do a thread safe implementation incase someone runs a threaded We may have missed each other at some point on the path saving functionality. Path saving is currently implemented and shown in the documentation. Is there anything wrong with the way it is done currently? I probably need to run this through the mailing list again to solicit feedback. Overall, I think most of the major implementation questions have been answered. |
@dgasmith I'm not entirely sure how the Cython build process is done in NumPy, but I know it is done (the This is actually a case where Python GIL makes things easy -- by default, everything in Python is thread safe. You simply would not release the GIL for the path optimization step. Yes, I do see that path saving is current implemented. This is probably fine, though possibly even unnecessary if we get transparent caching going. |
Unfortunately using cython here won't be as trivial as one might think. The Two possibilities for enabling cython usage:
|
It seems just writing this in C would be the easiest route. The largest issue that I see is loosing out on set arithmetic. I would guess from the limited number of valid einsum indices (48) a bit array set implementation would be the easiest and likely very fast. Although, I have not thought about this before. Does anyone have an opinion before I dig into it? |
a set of real world use cases where this is too slow would be useful first, to better judge what exactly needs to be speed up. Then we can decide how to do it. for cython we could simple concatenate a bunch of files before running cython to get around the duplication issue. |
from a brief look at the code it does not look like code that would profit much from using cython, it definitely needs to be profiled before deciding on the way to go. Unless there is one really obvious bottleneck that needs typing it looks like it should stay python or be really low level C using special purpose data structures instead of generic python ones. |
Is there a straightforward way to use internal routines written in C for einsum (e.g., for argument parsing) on the Python side? That was part of the reason why I was thinking it might be desirable to do this from Cython/C. |
In the As for writing stuff in Python first, right now |
If next up is dropping this into
I would prefer a dictionary over a tuple for passing information, but either is fine. My main concern would be ensuring that reference counting is still handled correctly. If this sounds fine, I can try to write this out. |
That sounds fine. Or instead of a dict/tuple (which are relatively But I haven't actually looked at the code, so for all I know it's already
|
On a quick looksie, this looks sort of, but not quite, ready. In particular, the question of whether it should be an improved einsum, which requires complete compatibility. I'm tempted to put this off to Numpy 1.11.0 unless it becomes active again. |
@charris The code that actually does the optimization is fairly close to done. Where I keep running into issues is trying to take advantage of the current C parsing. We only really need the einsum string returned to the python side, unfortunately part of the parsing is written in the primary einsum code and is not easily extractable. I am not quite sure I can get the information back out without making fairly substantial changes to the code, which is not something I would really like to do. The other option is for optimized einsum contractions to have its own parsing python side. If you need to worry about optimization you definitely will not notice the extra overhead. Although, it would create two codes that do the same thing. |
Yeah, the einsum c parsing is a handful. I don't mind doing parsing python side. I guess the question is do we add a new function, or maybe a keyword. I wouldn't mind a bit of help here from the others who have looked at the code: @shoyer @njsmith @juliantaylor @jaimefrio . |
From the user point of view I really think optimization should be as @charris https://github.com/charris The code that actually does the The other option is for optimized einsum contractions to have its own — |
@njsmith I was originally planning to leave optimization off by default. It is possible that turning optimization on would break some |
I started working on merging this into the einsum function with python side parsing: Thinking on this a bit more, all einsum kwargs besides |
It's fine if einsum is faster in simple cases and slower if more complex
|
@dgasmith Keep us informed of progress. I might put this to numpy 1.11. |
@charris The complete overhead for computing a path (parsing the input, finding the path, and organization that data) with default options is about 150us. Looks like einsum takes a minimum of 5-10us to call as a reference. So the worst case scenario would be that the optimization overhead makes einsum 30x slower. Personally id go for turning optimization off by default and then revisiting if someone tackles the parsing issue to reduce the overhead. For my first question I added a function called Should I revert the add_doc documentation, but leave the documentation in I don't really have an opinion on most documentation/filename issues. Just tell me what needs fixing. |
Given the call overhead, I agree that turning off optimization by default is the way to go. Hmm, the examples I am not a fan of multiple output types depending on passed parameters. It seems simple enough to call @shoyer @njsmith @seberg This is getting close to a merge, I'd welcome any final input at this point. |
Note that at some future date it might make sense to call |
If it is very simple, another choice might be to default to no optimization based on some simple heuristic such as the input size. In that case it could be even neater to call back into python for optimization instead the other way around. However, I guess it should not make a big difference. The einsum path information seems useful if you have a bigger problem and want to know how to approach it. I could also imagine just makeing it Have to look more at this to give real input though, will try to do that later. |
Also accepts an explicit contraction list from the ``np.einsum_path`` | ||
function. | ||
See ``np.einsum_path`` for more details. | ||
memory_limit : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a parameter that only has an effect if another a parameter is set is not very nice
maybe accepting a tuple for optimize with the first being the optimize value and the second the memory limit is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can do that. Id like to make it so that someone can simply say optimize=True
as well. The default memory_limit
is pretty reasonable and is fine for 99% of cases I think.
@seberg The biggest problem I've had with simple heuristics is parsing the input python-side is quite slow by itself (30-100us), at that point computing the path isn't a huge overhead. Parsing C-side is really quite beneficial, disentangling einsum from its parsing C-side seemed like a fairly large project however. One possibility would be to add a series of heuristics that check at different parsing stages if you should default back to c_einsum or not. This has always seemed more complex than its worth at this stage. @juliantaylor I liked the idea of removing memory_limit as a separate keyword so I rearranged that a bit. The optimize keyword is getting a bit crowded, let me know if you see any issues with the current setup. @charris Moving the c_einsum around a bit looks good to me. Would we add the full documentation back to c_einsum, or only a part of it? |
@dgasmith Let's rename EDIT: At some point we might want to move bits around, but I think we should be able to do that without causing problems. AFAICT, no one currently uses |
@dgasmith I think we can put this in once the remaining details are taken care of. |
☔ The latest upstream changes (presumably #7980) made this pull request unmergeable. Please resolve the merge conflicts. |
@@ -0,0 +1,979 @@ | |||
from numpy.core.multiarray import einsum as c_einsum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This need some module documentation and the standard lead in
"""
Implementation of optimized einsum.
"""
from __future__ import division, absolute_import, print_function
overall_contraction = input_subscripts + "->" + output_subscript | ||
header = ("scaling", "current", "remaining") | ||
|
||
speedup = naive_cost / float(opt_cost) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Python 3 division (see above), the float is not needed.
def einsum(*operands, **kwargs): | ||
""" | ||
einsum(subscripts, *operands, out=None, dtype=None, order='K', | ||
casting='safe', optimize=True, memory_limit=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimize default is False
, not True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory_limit
seems to be gone.
Specifies the subscripts for summation. | ||
operands : list of array_like | ||
These are the arrays for the operation. | ||
out : ndarray, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ndarray, None}
These are the arrays for the operation. | ||
out : ndarray, optional | ||
If provided, the calculation is done into this array. | ||
dtype : data-type, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be {data-type, None}
.
These are the arrays for the operation. | ||
out : ndarray, optional | ||
If provided, the calculation is done into this array. | ||
dtype : data-type, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs the default value documented.
* 'safe' means only casts which can preserve values are allowed. | ||
* 'same_kind' means only safe casts or casts within a kind, | ||
like float64 to float32, are allowed. | ||
* 'unsafe' means any data conversions may be done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need default documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charris I could not find another place were casting had a default or where there was extra text after a list. So I hope the following is parsed correctly. Should the default be added to other casting descriptions as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is 'safe'
, implemented in array_einsum
. Not sure what to say for the dtype
argument, maybe smallest common type compatible with safe casting.
We try to document the default values of all optional arguments, although we evidently haven't been perfect about that. So the answer to the last guestion is "yes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion it's actually a cleaner style to document default values only in function signatures. Otherwise, it's easy to get out of sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we added defaults to the text for dtype
, it would probably best to add them for all signatures. Might be simpler to skip the defaults for now and open a new issue to decide either way so that it can be applied evenly.
Controls if intermediate optimization should occur. No optimization | ||
will occur if False and True will default to the 'greedy' algorithm. | ||
Also accepts an explicit contraction list from the ``np.einsum_path`` | ||
function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs default documented. I wouldn't put See ...
on a separate line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to memory_limit
, it is still part of the function signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I popped the memory_limit
argument as @juliantaylor recommended. I updated the documentation, but didnt take out memory_limit
in other places. Ill fix this up as well.
The conflict is the release notes. Three things that need fixing/checking
Changing my previous suggestion, If you don't have time for this, let me know and I will fix up the few remaining nits. |
@charris Ill clean up the remaining issues Monday morning. Thanks for looking through this! |
83e9433
to
9c8ad81
Compare
9c8ad81
to
4c788f4
Compare
Returns | ||
------- | ||
output : ndarray | ||
The calculation based on the Einstein summation convention. | ||
|
||
See Also | ||
-------- | ||
dot, inner, outer, tensordot | ||
einsum, dot, inner, outer, tensordot | ||
|
||
Notes | ||
----- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, the examples need changing, but I can do that later. I expect this function might be changing a bit down the line in any case.
OK, let's get this in before the release notes change again. Thanks @dgasmith. |
@charris Great, thanks for getting this merged in! Please ping me for any required fixes. I will probably have some tweaks to the algorithms down the line and ill probably look into adding BLAS calls back as well. |
This is awesome, thanks @dgasmith for all your hard work on this one! |
Greetings everyone,
Numpy's einsum function can compute arbitrary expressions; however, there are two drawbacks to using pure einsum: einsum does not consider building intermediate arrays for possible reductions in overall rank and is not currently capable of using a vendor BLAS. The
contract
function aims to solve these issues and effectively set out to answer the question: "How fast can I execute a Einstein summation expression in numpy?" The answer turns out to be quite fast, often only 2-3 times slower than an optimized C or Fortran expression, especially when a vendor BLAS utilized. In addition, multi-argument einsum speed issues (like #5366) should be solved and, when BLAS functionality is added, can reproduce multi_dot like behavior (#4977).The original repository has additional information that I have not been able to cram into the current documentation. There are additional test scripts that provide random expression testing, single expression debugging, and path comparisons, in addition to evaluate timing. Other branches on this repo have both BLAS and ellipses in the subscript functionality.
Following a few recent PR's I have decided to break this up into three major commits:
I think this will help split up some of the complexity and the first PR is already large enough.