NumPy ndarray expression with broadcast is slower when not use local variable. #6387

ruoyu0088 · 2020-10-17T12:41:53Z

Numba 0.51.2

Here is the code, f1() is 24x faster than f2().

import numpy as np
from numba import njit
x = np.random.randn(1, 1000)
y = np.random.randn(100, 1)

@njit
def f1(x, y):
    a = np.sin(x**2)
    b = np.cos(y**2)
    return a + b

@njit
def f2(x, y):
    return np.sin(x**2) + np.cos(y**2)

f1(x, y)
f2(x, y)

%timeit f1(x, y)
%timeit f2(x, y)

and here is the result:

75.4 µs ± 2.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.81 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The text was updated successfully, but these errors were encountered:

esc · 2020-10-19T09:24:33Z

@ruoyu0088 thanks for submitting this to the Numba issue tracker. I can confirm I am able to reproduce the timings on my local machine.

In [2]: %timeit f1(x,y)
101 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [3]: %timeit f2(x,y)
1.55 ms ± 31.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

esc · 2020-10-20T11:54:04Z

@ruoyu0088 so, this is a very interesting little issue you stumbled upon. We discussed this in the triage meeting yesterday and came to the following conclusions:

This is probably the result of how the array expressions are handled. In the fast example, you broadcast at the very end. Whereas in the slow example, broadcasting happens as a part of the main loop and so more work is done. So, even though the first one probably creates additional arrays (a and b) in memory it is faster overall.

Below you will find the control flow graphs for the two functions:

SVGOUTPUT1.pdf
SVGOUTPUT2.pdf

As you can see, the IR for the slow example is less, because there are less loops and such.

Lastly, I also looked at how numexpr (an alternative array expression compiler) fairs in this case:

In [1]: import numpy as np

In [2]: import numexpr as ne

In [3]: x = np.random.randn(1, 1000)

In [4]: y = np.random.randn(100, 1)

In [5]: %timeit ne.evaluate("sin(x**2)") + ne.evaluate("cos(y**2)")
85.8 µs ± 7.84 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [6]: %timeit ne.evaluate("sin(x**2) + cos(y**2)")
196 µs ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

As you can see the slower example is also slower here, but only by 2x, not, 24x.

Numba could probably be optimized if instead of generating ufuncs for array expressions we would generate loops. This would give LLVM more optimization opportunities and the assumption is that Numba may perform better in this case. However it appears as though, this will require a significant change to the Numba internals and how array expressions are compiled.

ruoyu0088 · 2020-10-21T00:26:17Z

@esc Thanks for reply, does Numba generate loops similar as the following code, so it doesn't create any intermediate arrays?

def f2(x, y):
	out = ... # create an array with broadcasted shape 
	for x_, y_, out_ in np.iter((x, y, out)):
		r = np.sin(x_**2) + np.cos(y_**2)
		out_.itemset(r)    
	return out

From the document of numexpr, it splits arrays into small chunks, so it is something between f1() and f2(). It create small intermediate arrays, and generate loops to do the calculation.

esc · 2020-10-21T14:55:57Z

@ruoyu0088 from what I understand, I think that is correct, in the sense that Numba tries to avoid generating temporaries, but I'm really not too well versed in that part of Numba yet, so perhaps someone else could give you a more definitive answer. If you want to know for sure, I would suggest using inspect_cfg to look at the LLVM IR that Numba generates for the two variants of f2 that you have and see how they compare.

stuartarchibald · 2020-10-21T21:07:16Z

@ruoyu0088 in this case, the function f2 defined above is approximately what Numba did in the "slow" function in the OP.

esc added needtriage performance - run time Performance issue occurring at run time. labels Oct 19, 2020

esc added numpy and removed needtriage labels Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NumPy ndarray expression with broadcast is slower when not use local variable. #6387

NumPy ndarray expression with broadcast is slower when not use local variable. #6387

ruoyu0088 commented Oct 17, 2020

esc commented Oct 19, 2020

esc commented Oct 20, 2020

ruoyu0088 commented Oct 21, 2020

esc commented Oct 21, 2020 •

edited

stuartarchibald commented Oct 21, 2020

NumPy ndarray expression with broadcast is slower when not use local variable. #6387

NumPy ndarray expression with broadcast is slower when not use local variable. #6387

Comments

ruoyu0088 commented Oct 17, 2020

esc commented Oct 19, 2020

esc commented Oct 20, 2020

ruoyu0088 commented Oct 21, 2020

esc commented Oct 21, 2020 • edited

stuartarchibald commented Oct 21, 2020

esc commented Oct 21, 2020 •

edited