ENH: speed-up of triangular matrix functions #4509

jaimefrio · 2014-03-17T21:54:33Z

np.tri now produces less intermediate arrays. Runs about 40% faster for
general dtypes, up to 3x faster for boolean arrays.
np.tril now does smarter type conversions (thanks Julian!), and together
with the improvements in np.tri now runs about 30% faster. np.triu
runs almost 2x faster than before, but still runs 20% slower than
np.tril, which is an improvement over the 50% difference before.
np.triu_indices and np.tril_indices do not call np.mask_indices,
instead they call np.where directly on a boolean array created with
np.tri. They now run roughly 2x faster.
Removed the constraint for the array to be square in calls to
np.triu_indices, np.tril_indices, np.triu_indices_from and
np.tril_indices_from.

juliantaylor · 2014-03-17T22:27:34Z

numpy/lib/twodim_base.py

+    if invert:
+        m = less.outer(arange(N), arange(-k, M-k))
+    else:
+        m = greater_equal.outer(arange(N), arange(-k, M-k))


premature optimization alert!
I wonder if this would be faster if the arrays were float32 instead of integers, the float boolean operations are vectorized while the integers are not (yet), would probably needs range checks to avoid rounding issues

Yes, it would be almost 2x faster:

In [180]: %timeit np.less.outer(np.arange(1000), np.arange(1000)) 1000 loops, best of 3: 1.19 ms per loop In [181]: %timeit np.less.outer(np.arange(1000, dtype=np.float32), np.arange(1000, dtype=np.float32)) 1000 loops, best of 3: 676 µs per loop

I guess I need to get to vectorizing integers :)

That would be nice to have for sure. But as you mentioned elsewhere this is already on the verge of optimization for the sake of optimization. Using float32 aranges to speed the comparison up will push it all the way to "root of all evil" in Knuth's words.

juliantaylor · 2014-03-17T22:33:01Z

looks interesting, the changes do seem simple enough to be added even without evidence that we need it, but I am curious if these functions really are bottlenecks in real applications?

jaimefrio · 2014-03-17T23:08:39Z

I looked into this after seeing this question in StackOverflow. It is a little appalling that the indexing turns out to be the slowest operation in that type of calculations.

jaimefrio · 2014-03-18T18:19:02Z

There are changes to the public API involved. Nothing too big, but they are there. Should I try to get some feedback from the main list? If yes, I would rather wait until the flames of the @ proposal die down, or no one is going to even look at it.

juliantaylor · 2014-03-18T18:45:11Z

numpy/lib/twodim_base.py

@@ -757,17 +773,24 @@ def mask_indices(n, mask_func, k=0):
    return where(a != 0)


-def tril_indices(n, k=0):
+def tril_indices(n, k=0, m=None):
    """
    Return the indices for the lower-triangle of an (n, n) array.


this is now an (n, m) array I assume

Oops, sloppy me. Same thing triu_indices as well.

charris · 2014-03-24T22:39:41Z

@jaimefrio Now might be a good time.

juliantaylor · 2014-03-24T22:44:21Z

he already sent a message to the list.
I think this is fine, though I think a new invert keyword argument only for probably not so important performance is a little overkill
how does creating a boolean tri and then using multiply with an output type to create triu perform?

* `np.tri` now produces less intermediate arrays. Runs about 40% faster for general dtypes, up to 3x faster for boolean arrays. * `np.tril` now does smarter type conversions (thanks Julian!), and together with the improvements in `np.tri` now runs about 30% faster. `np.triu` runs almost 2x faster than before, but still runs 20% slower than `np.tril`, which is an improvement over the 50% difference before. * `np.triu_indices` and `np.tril_indices` do not call `np.mask_indices`, instead they call `np.where` directly on a boolean array created with `np.tri`. They now run roughly 2x faster. * Removed the constraint for the array to be square in calls to `np.triu_indices`, `np.tril_indices`, `np.triu_indices_from` and `np.tril_indices_from`.

jaimefrio · 2014-03-25T16:57:33Z

Ah, good idea! Letting multiply do the casting makes things about 30% faster overall. I still don't like having an unexpected performance difference between np.triu and np.tril, but I will have to learn to live with it:

In [1]: import numpy as np

In [2]: a = np.random.rand(1000, 1000)

In [3]: %timeit np.tril(a)
100 loops, best of 3: 6.62 ms per loop

In [4]: %timeit np.triu(a)
100 loops, best of 3: 7.94 ms per loop

In [9]: %timeit np.triu_indices_from(a)
100 loops, best of 3: 11.4 ms per loop

In [10]: %timeit np.tril_indices_from(a)
100 loops, best of 3: 10.4 ms per loop

juliantaylor · 2014-03-25T18:54:02Z

the difference is only 10% on my machine, it does not seem relevant enough for new api
the improvement looks nice merging tomorrow if no one else objects

btw. in this branch I have integer compare vectorization https://github.com/juliantaylor/numpy/tree/int-vectorize-compiler gains about 20% if the arange fits into a short even though the compiler vectorization is far from optimal, might be worth revisiting this if we merge that branch

juliantaylor · 2014-03-26T19:34:27Z

at least on linux creating an array with zeros and then using copyto(where=tri) is even faster and has the advantage that the matrix is partially sparse if its really large (rows > page size)
though it needs some extra logic to support subtypes as zeros_like does not use calloc alloc yet (due to subtypes)

jaimefrio · 2014-03-26T21:08:41Z

I did look into something along the lines of what you propose. Something like the following:

mask = tri(*m.shape[-2:], k)
tri_lower = np.zeros_like(m)
tri_lower[mask] = m[mask]

was faster when m is 2D. But these functions support stacks of 2D arrays, which means you have to do:

tri_lower[..., mask] = m[..., mask]

and that killed performance big time.

Not sure if using copyto allows to avoid this. Or maybe we should special case 2D arrays to use boolean mask assignment and keep higher dimensions doing multiplication.

juliantaylor · 2014-03-26T21:16:00Z

hm why would that kill performance?

while the sparseness would be nice the inputs would need to be pretty big to actually be able to use it and the behavior would be inconsistent between platforms anyway, so this is good enough, thanks merging

ENH: speed-up of triangular matrix functions

jaimefrio · 2014-03-26T21:28:58Z

Not sure why, but it is easy to test:

In [3]: tri = np.random.randint(2, size=(1000, 1000)).astype(np.bool)

In [4]: a = np.random.rand(1000, 1000)

In [5]: %timeit a[tri] = 0
100 loops, best of 3: 6.97 ms per loop

In [6]: %timeit a[..., tri] = 0
100 loops, best of 3: 18 ms per loop

juliantaylor reviewed Mar 17, 2014
View reviewed changes

juliantaylor reviewed Mar 18, 2014
View reviewed changes

juliantaylor added a commit that referenced this pull request Mar 26, 2014

Merge pull request #4509 from jaimefrio/twodim-speedup

7395900

ENH: speed-up of triangular matrix functions

juliantaylor merged commit 7395900 into numpy:master Mar 26, 2014

jaimefrio deleted the twodim-speedup branch March 26, 2014 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: speed-up of triangular matrix functions #4509

ENH: speed-up of triangular matrix functions #4509

jaimefrio commented Mar 17, 2014

juliantaylor Mar 17, 2014

jaimefrio Mar 18, 2014

juliantaylor Mar 18, 2014

jaimefrio Mar 18, 2014

juliantaylor commented Mar 17, 2014

jaimefrio commented Mar 17, 2014

jaimefrio commented Mar 18, 2014

juliantaylor Mar 18, 2014

jaimefrio Mar 18, 2014

charris commented Mar 24, 2014

juliantaylor commented Mar 24, 2014

jaimefrio commented Mar 25, 2014

juliantaylor commented Mar 25, 2014

juliantaylor commented Mar 26, 2014

jaimefrio commented Mar 26, 2014

juliantaylor commented Mar 26, 2014

jaimefrio commented Mar 26, 2014

ENH: speed-up of triangular matrix functions #4509

ENH: speed-up of triangular matrix functions #4509

Conversation

jaimefrio commented Mar 17, 2014

juliantaylor Mar 17, 2014

Choose a reason for hiding this comment

jaimefrio Mar 18, 2014

Choose a reason for hiding this comment

juliantaylor Mar 18, 2014

Choose a reason for hiding this comment

jaimefrio Mar 18, 2014

Choose a reason for hiding this comment

juliantaylor commented Mar 17, 2014

jaimefrio commented Mar 17, 2014

jaimefrio commented Mar 18, 2014

juliantaylor Mar 18, 2014

Choose a reason for hiding this comment

jaimefrio Mar 18, 2014

Choose a reason for hiding this comment

charris commented Mar 24, 2014

juliantaylor commented Mar 24, 2014

jaimefrio commented Mar 25, 2014

juliantaylor commented Mar 25, 2014

juliantaylor commented Mar 26, 2014

jaimefrio commented Mar 26, 2014

juliantaylor commented Mar 26, 2014

jaimefrio commented Mar 26, 2014