Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: speed-up of triangular matrix functions #4509

Merged
merged 1 commit into from Mar 26, 2014

Conversation

jaimefrio
Copy link
Member

  • np.tri now produces less intermediate arrays. Runs about 40% faster for
    general dtypes, up to 3x faster for boolean arrays.
  • np.tril now does smarter type conversions (thanks Julian!), and together
    with the improvements in np.tri now runs about 30% faster. np.triu
    runs almost 2x faster than before, but still runs 20% slower than
    np.tril, which is an improvement over the 50% difference before.
  • np.triu_indices and np.tril_indices do not call np.mask_indices,
    instead they call np.where directly on a boolean array created with
    np.tri. They now run roughly 2x faster.
  • Removed the constraint for the array to be square in calls to
    np.triu_indices, np.tril_indices, np.triu_indices_from and
    np.tril_indices_from.

if invert:
m = less.outer(arange(N), arange(-k, M-k))
else:
m = greater_equal.outer(arange(N), arange(-k, M-k))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

premature optimization alert!
I wonder if this would be faster if the arrays were float32 instead of integers, the float boolean operations are vectorized while the integers are not (yet), would probably needs range checks to avoid rounding issues

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be almost 2x faster:

In [180]: %timeit np.less.outer(np.arange(1000), np.arange(1000))
1000 loops, best of 3: 1.19 ms per loop

In [181]: %timeit np.less.outer(np.arange(1000, dtype=np.float32), np.arange(1000, dtype=np.float32))
1000 loops, best of 3: 676 µs per loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I need to get to vectorizing integers :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice to have for sure. But as you mentioned elsewhere this is already on the verge of optimization for the sake of optimization. Using float32 aranges to speed the comparison up will push it all the way to "root of all evil" in Knuth's words.

@juliantaylor
Copy link
Contributor

looks interesting, the changes do seem simple enough to be added even without evidence that we need it, but I am curious if these functions really are bottlenecks in real applications?

@jaimefrio
Copy link
Member Author

I looked into this after seeing this question in StackOverflow. It is a little appalling that the indexing turns out to be the slowest operation in that type of calculations.

@jaimefrio
Copy link
Member Author

There are changes to the public API involved. Nothing too big, but they are there. Should I try to get some feedback from the main list? If yes, I would rather wait until the flames of the @ proposal die down, or no one is going to even look at it.

@@ -757,17 +773,24 @@ def mask_indices(n, mask_func, k=0):
return where(a != 0)


def tril_indices(n, k=0):
def tril_indices(n, k=0, m=None):
"""
Return the indices for the lower-triangle of an (n, n) array.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is now an (n, m) array I assume

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sloppy me. Same thing triu_indices as well.

@charris
Copy link
Member

charris commented Mar 24, 2014

@jaimefrio Now might be a good time.

@juliantaylor
Copy link
Contributor

he already sent a message to the list.
I think this is fine, though I think a new invert keyword argument only for probably not so important performance is a little overkill
how does creating a boolean tri and then using multiply with an output type to create triu perform?

* `np.tri` now produces less intermediate arrays. Runs about 40% faster for
  general dtypes, up to 3x faster for boolean arrays.
* `np.tril` now does smarter type conversions (thanks Julian!), and together
  with the improvements in `np.tri` now runs about 30% faster. `np.triu`
  runs almost 2x faster than before, but still runs 20% slower than
  `np.tril`, which is an improvement over the 50% difference before.
* `np.triu_indices` and `np.tril_indices` do not call `np.mask_indices`,
  instead they call `np.where` directly on a boolean array created with
  `np.tri`. They now run roughly 2x faster.
* Removed the constraint for the array to be square in calls to
  `np.triu_indices`, `np.tril_indices`, `np.triu_indices_from` and
  `np.tril_indices_from`.
@jaimefrio
Copy link
Member Author

Ah, good idea! Letting multiply do the casting makes things about 30% faster overall. I still don't like having an unexpected performance difference between np.triu and np.tril, but I will have to learn to live with it:

In [1]: import numpy as np

In [2]: a = np.random.rand(1000, 1000)

In [3]: %timeit np.tril(a)
100 loops, best of 3: 6.62 ms per loop

In [4]: %timeit np.triu(a)
100 loops, best of 3: 7.94 ms per loop

In [9]: %timeit np.triu_indices_from(a)
100 loops, best of 3: 11.4 ms per loop

In [10]: %timeit np.tril_indices_from(a)
100 loops, best of 3: 10.4 ms per loop

@juliantaylor
Copy link
Contributor

the difference is only 10% on my machine, it does not seem relevant enough for new api
the improvement looks nice merging tomorrow if no one else objects

btw. in this branch I have integer compare vectorization https://github.com/juliantaylor/numpy/tree/int-vectorize-compiler gains about 20% if the arange fits into a short even though the compiler vectorization is far from optimal, might be worth revisiting this if we merge that branch

@juliantaylor
Copy link
Contributor

at least on linux creating an array with zeros and then using copyto(where=tri) is even faster and has the advantage that the matrix is partially sparse if its really large (rows > page size)
though it needs some extra logic to support subtypes as zeros_like does not use calloc alloc yet (due to subtypes)

@jaimefrio
Copy link
Member Author

I did look into something along the lines of what you propose. Something like the following:

mask = tri(*m.shape[-2:], k)
tri_lower = np.zeros_like(m)
tri_lower[mask] = m[mask]

was faster when m is 2D. But these functions support stacks of 2D arrays, which means you have to do:

tri_lower[..., mask] = m[..., mask]

and that killed performance big time.

Not sure if using copyto allows to avoid this. Or maybe we should special case 2D arrays to use boolean mask assignment and keep higher dimensions doing multiplication.

@juliantaylor
Copy link
Contributor

hm why would that kill performance?

while the sparseness would be nice the inputs would need to be pretty big to actually be able to use it and the behavior would be inconsistent between platforms anyway, so this is good enough, thanks merging

juliantaylor added a commit that referenced this pull request Mar 26, 2014
ENH: speed-up of triangular matrix functions
@juliantaylor juliantaylor merged commit 7395900 into numpy:master Mar 26, 2014
@jaimefrio
Copy link
Member Author

Not sure why, but it is easy to test:

In [3]: tri = np.random.randint(2, size=(1000, 1000)).astype(np.bool)

In [4]: a = np.random.rand(1000, 1000)

In [5]: %timeit a[tri] = 0
100 loops, best of 3: 6.97 ms per loop

In [6]: %timeit a[..., tri] = 0
100 loops, best of 3: 18 ms per loop

@jaimefrio jaimefrio deleted the twodim-speedup branch March 26, 2014 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants