Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NumPy ndarray expression with broadcast is slower when not use local variable. #6387

Open
ruoyu0088 opened this issue Oct 17, 2020 · 5 comments
Labels
numpy performance - run time Performance issue occurring at run time.

Comments

@ruoyu0088
Copy link

Numba 0.51.2

Here is the code, f1() is 24x faster than f2().

import numpy as np
from numba import njit
x = np.random.randn(1, 1000)
y = np.random.randn(100, 1)

@njit
def f1(x, y):
    a = np.sin(x**2)
    b = np.cos(y**2)
    return a + b

@njit
def f2(x, y):
    return np.sin(x**2) + np.cos(y**2)

f1(x, y)
f2(x, y)

%timeit f1(x, y)
%timeit f2(x, y)

and here is the result:

75.4 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.81 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
@esc esc added needtriage performance - run time Performance issue occurring at run time. labels Oct 19, 2020
@esc
Copy link
Member

esc commented Oct 19, 2020

@ruoyu0088 thanks for submitting this to the Numba issue tracker. I can confirm I am able to reproduce the timings on my local machine.

In [2]: %timeit f1(x,y)
101 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit f2(x,y)
1.55 ms ± 31.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@esc esc added numpy and removed needtriage labels Oct 20, 2020
@esc
Copy link
Member

esc commented Oct 20, 2020

@ruoyu0088 so, this is a very interesting little issue you stumbled upon. We discussed this in the triage meeting yesterday and came to the following conclusions:

This is probably the result of how the array expressions are handled. In the fast example, you broadcast at the very end. Whereas in the slow example, broadcasting happens as a part of the main loop and so more work is done. So, even though the first one probably creates additional arrays (a and b) in memory it is faster overall.

Below you will find the control flow graphs for the two functions:

SVGOUTPUT1.pdf
SVGOUTPUT2.pdf

As you can see, the IR for the slow example is less, because there are less loops and such.

Lastly, I also looked at how numexpr (an alternative array expression compiler) fairs in this case:

In [1]: import numpy as np

In [2]: import numexpr as ne

In [3]: x = np.random.randn(1, 1000)

In [4]: y = np.random.randn(100, 1)

In [5]: %timeit ne.evaluate("sin(x**2)") + ne.evaluate("cos(y**2)")
85.8 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit ne.evaluate("sin(x**2) + cos(y**2)")
196 µs ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As you can see the slower example is also slower here, but only by 2x, not, 24x.

Numba could probably be optimized if instead of generating ufuncs for array expressions we would generate loops. This would give LLVM more optimization opportunities and the assumption is that Numba may perform better in this case. However it appears as though, this will require a significant change to the Numba internals and how array expressions are compiled.

@ruoyu0088
Copy link
Author

@esc Thanks for reply, does Numba generate loops similar as the following code, so it doesn't create any intermediate arrays?

def f2(x, y):
	out = ... # create an array with broadcasted shape 
	for x_, y_, out_ in np.iter((x, y, out)):
		r = np.sin(x_**2) + np.cos(y_**2)
		out_.itemset(r)    
	return out

From the document of numexpr, it splits arrays into small chunks, so it is something between f1() and f2(). It create small intermediate arrays, and generate loops to do the calculation.

@esc
Copy link
Member

esc commented Oct 21, 2020

@ruoyu0088 from what I understand, I think that is correct, in the sense that Numba tries to avoid generating temporaries, but I'm really not too well versed in that part of Numba yet, so perhaps someone else could give you a more definitive answer. If you want to know for sure, I would suggest using inspect_cfg to look at the LLVM IR that Numba generates for the two variants of f2 that you have and see how they compare.

@stuartarchibald
Copy link
Contributor

@ruoyu0088 in this case, the function f2 defined above is approximately what Numba did in the "slow" function in the OP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
numpy performance - run time Performance issue occurring at run time.
Projects
None yet
Development

No branches or pull requests

3 participants