New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: speed-up of triangular matrix functions #4509
Conversation
if invert: | ||
m = less.outer(arange(N), arange(-k, M-k)) | ||
else: | ||
m = greater_equal.outer(arange(N), arange(-k, M-k)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
premature optimization alert!
I wonder if this would be faster if the arrays were float32 instead of integers, the float boolean operations are vectorized while the integers are not (yet), would probably needs range checks to avoid rounding issues
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be almost 2x faster:
In [180]: %timeit np.less.outer(np.arange(1000), np.arange(1000))
1000 loops, best of 3: 1.19 ms per loop
In [181]: %timeit np.less.outer(np.arange(1000, dtype=np.float32), np.arange(1000, dtype=np.float32))
1000 loops, best of 3: 676 µs per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I need to get to vectorizing integers :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice to have for sure. But as you mentioned elsewhere this is already on the verge of optimization for the sake of optimization. Using float32 aranges to speed the comparison up will push it all the way to "root of all evil" in Knuth's words.
looks interesting, the changes do seem simple enough to be added even without evidence that we need it, but I am curious if these functions really are bottlenecks in real applications? |
I looked into this after seeing this question in StackOverflow. It is a little appalling that the indexing turns out to be the slowest operation in that type of calculations. |
There are changes to the public API involved. Nothing too big, but they are there. Should I try to get some feedback from the main list? If yes, I would rather wait until the flames of the |
@@ -757,17 +773,24 @@ def mask_indices(n, mask_func, k=0): | |||
return where(a != 0) | |||
|
|||
|
|||
def tril_indices(n, k=0): | |||
def tril_indices(n, k=0, m=None): | |||
""" | |||
Return the indices for the lower-triangle of an (n, n) array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is now an (n, m) array I assume
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sloppy me. Same thing triu_indices
as well.
@jaimefrio Now might be a good time. |
he already sent a message to the list. |
* `np.tri` now produces less intermediate arrays. Runs about 40% faster for general dtypes, up to 3x faster for boolean arrays. * `np.tril` now does smarter type conversions (thanks Julian!), and together with the improvements in `np.tri` now runs about 30% faster. `np.triu` runs almost 2x faster than before, but still runs 20% slower than `np.tril`, which is an improvement over the 50% difference before. * `np.triu_indices` and `np.tril_indices` do not call `np.mask_indices`, instead they call `np.where` directly on a boolean array created with `np.tri`. They now run roughly 2x faster. * Removed the constraint for the array to be square in calls to `np.triu_indices`, `np.tril_indices`, `np.triu_indices_from` and `np.tril_indices_from`.
Ah, good idea! Letting
|
the difference is only 10% on my machine, it does not seem relevant enough for new api btw. in this branch I have integer compare vectorization https://github.com/juliantaylor/numpy/tree/int-vectorize-compiler gains about 20% if the arange fits into a short even though the compiler vectorization is far from optimal, might be worth revisiting this if we merge that branch |
at least on linux creating an array with zeros and then using copyto(where=tri) is even faster and has the advantage that the matrix is partially sparse if its really large (rows > page size) |
I did look into something along the lines of what you propose. Something like the following:
was faster when
and that killed performance big time. Not sure if using copyto allows to avoid this. Or maybe we should special case 2D arrays to use boolean mask assignment and keep higher dimensions doing multiplication. |
hm why would that kill performance? while the sparseness would be nice the inputs would need to be pretty big to actually be able to use it and the behavior would be inconsistent between platforms anyway, so this is good enough, thanks merging |
ENH: speed-up of triangular matrix functions
Not sure why, but it is easy to test:
|
np.tri
now produces less intermediate arrays. Runs about 40% faster forgeneral dtypes, up to 3x faster for boolean arrays.
np.tril
now does smarter type conversions (thanks Julian!), and togetherwith the improvements in
np.tri
now runs about 30% faster.np.triu
runs almost 2x faster than before, but still runs 20% slower than
np.tril
, which is an improvement over the 50% difference before.np.triu_indices
andnp.tril_indices
do not callnp.mask_indices
,instead they call
np.where
directly on a boolean array created withnp.tri
. They now run roughly 2x faster.np.triu_indices
,np.tril_indices
,np.triu_indices_from
andnp.tril_indices_from
.