New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NumPy ndarray expression with broadcast is slower when not use local variable. #6387
Comments
@ruoyu0088 thanks for submitting this to the Numba issue tracker. I can confirm I am able to reproduce the timings on my local machine.
|
@ruoyu0088 so, this is a very interesting little issue you stumbled upon. We discussed this in the triage meeting yesterday and came to the following conclusions: This is probably the result of how the array expressions are handled. In the fast example, you broadcast at the very end. Whereas in the slow example, broadcasting happens as a part of the main loop and so more work is done. So, even though the first one probably creates additional arrays (a and b) in memory it is faster overall. Below you will find the control flow graphs for the two functions: As you can see, the IR for the slow example is less, because there are less loops and such. Lastly, I also looked at how
As you can see the slower example is also slower here, but only by 2x, not, 24x. Numba could probably be optimized if instead of generating ufuncs for array expressions we would generate loops. This would give LLVM more optimization opportunities and the assumption is that Numba may perform better in this case. However it appears as though, this will require a significant change to the Numba internals and how array expressions are compiled. |
@esc Thanks for reply, does Numba generate loops similar as the following code, so it doesn't create any intermediate arrays? def f2(x, y):
out = ... # create an array with broadcasted shape
for x_, y_, out_ in np.iter((x, y, out)):
r = np.sin(x_**2) + np.cos(y_**2)
out_.itemset(r)
return out From the document of numexpr, it splits arrays into small chunks, so it is something between f1() and f2(). It create small intermediate arrays, and generate loops to do the calculation. |
@ruoyu0088 from what I understand, I think that is correct, in the sense that Numba tries to avoid generating temporaries, but I'm really not too well versed in that part of Numba yet, so perhaps someone else could give you a more definitive answer. If you want to know for sure, I would suggest using |
@ruoyu0088 in this case, the function |
Numba 0.51.2
Here is the code,
f1()
is 24x faster thanf2()
.and here is the result:
The text was updated successfully, but these errors were encountered: