-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: optimize permutable nested prange #2960
Comments
Thanks for the request. Extracting the code from https://stackoverflow.com/questions/50255126/numba-doesnt-parallelize-range below. from numba import njit, prange
from time import time
@njit
def f1(n):
s = 0
for i in range(n):
for j in range(n):
for k in range(n):
s += (i * k < j * j) - (i * k > j * j)
return s
@njit
def f2(n):
s = 0
for i in prange(n):
for j in prange(n):
for k in prange(n):
s += (i * k < j * j) - (i * k > j * j)
return s
@njit(parallel=True)
def f3(n):
s = 0
for i in range(n):
for j in range(n):
for k in range(n):
s += (i * k < j * j) - (i * k > j * j)
return s
@njit(parallel=True)
def f4(n):
s = 0
for i in prange(n):
for j in prange(n):
for k in prange(n):
s += (i * k < j * j) - (i * k > j * j)
return s
for f in [f1, f2, f3, f4]:
d = time()
f(2500)
print('%.02f' % (time() - d)) As Numba is using JIT compilation, the timing part of the code above is including the compilation time as well as the execution time in the reported time. For reference, the above gives me:
editing the timing part of the code so that it is just timing execution: for f in [f1, f2, f3, f4]:
f(10) # invoke once with the type used in the timed section to trigger compilation
d = time()
f(2500)
print('%.02f' % (time() - d)) gives:
it doesn't make much difference as the compute part of the code is quite heavy for the given As to the four variants of the loop and what is observed in the timings: @njit
def f1(n):
s = 0
for i in range(n):
for j in range(n):
for k in range(n):
s += (i * k < j * j) - (i * k > j * j)
return s Function @njit
def f2(n):
s = 0
for i in prange(n):
for j in prange(n):
for k in prange(n):
s += (i * k < j * j) - (i * k > j * j)
return s Function @njit(parallel=True)
def f3(n):
s = 0
for i in range(n):
for j in range(n):
for k in range(n):
s += (i * k < j * j) - (i * k > j * j)
return s Function @njit(parallel=True)
def f4(n):
s = 0
for i in prange(n):
for j in prange(n):
for k in prange(n):
s += (i * k < j * j) - (i * k > j * j)
return s Function Declaring The information about what
Which shows In answer to your feature request... at present Numba compiles code based on the type of the arguments and not the values, it also compiles everything to machine code upfront and dispatches to compiled code based purely on type. The behaviour described in the feature request requires analysis based on run time values and so is more amenable to a tracing JIT which could feasibly analyse a loop nest instance at run time and perform dynamic loop nest optimisations. However, this is out of the scope of what Numba can do at present, I would think #2949 will help towards being able to achieve this though. This all said, I think you will get loop nest optimisation from the LLVM backed via e.g. loop switching if you use loop bounds that are fixed at compile time, e.g. if you declare your loops having a fixed sized like |
Hi, thanks you for this very precise answer. Do you have any hints about the loop order that fits best? I think the best performance is achieved when the outer loop is bigger, but in a concrete world example, i, j and k could also be array indexes and SIMD would be eventually used ; so I think the order of the innermost loops counts as well. Another temporary solution would be to write the code multiple times with different orders and select heuristically the best one. It could even be done programmatically by a pure python code that generates the different numba functions, and chooses which one is called each time. In my real world example, changing the loop order reduced my running time of more than 30% (and it wasn't even the worst possible case). |
For an arbitrary loop nest with a trivial, largely memory access free, body in which loops can be interchanged there's unlikely to be an advantage of parallelizing one outer loop over the another. What will matter in practice will be your data access patterns and instruction selection as it is unlikely that the loop body is trivial as per the examples. The loop order can in part can be guided by prior knowledge of the hardware and data layout, but should also be informed by profiling and testing. It is probably worth reading the performance tips page, and if you feel so inclined looking at the |
I see, thank you very much! After reflection, I think that the problem (optimizing the permutation of range product) is too complicated to be solved by LLVM, and I don't think #2949 will help it because we are speaking of very different range values, not just some cases. The only practical solution would be to automate the process of generating multiple loop orders and to have an heuristic to choose which one should be used. The only problem is that this heuristic should ideally depend on the LLVM IR generated. |
This raises an interesting question of how to do effective metaprogramming in Python. Ideally, you would want a factory function that could generate code with different loop permutations, benchmark each, and then pick the best one for the rest of the application run. (I've used this technique to pick CUDA thread grid configurations with great success.) Unfortunately, there isn't an easy way to do this kind of code generation in Python (probably not in most languages). There's no macro system, so much of the metaprogramming that Numba does internally is achieved by constructing strings of Python source code, |
Thank you, I was looking for the work metaprogramming. I think that just permuting loop products is easy. The idea here would not to benchmark them but to optimize the iteration order. Don't hesitate to correct me, I don't know very well the inners of numba (I'm starting to read the code), but I think it would be possible (although not trivial). |
I think the LLVM IR generated will essentially be invariant of loop permutation if the inner loop body is trivial. The general class of problem presented as optimising loop iteration order for non-trivial cases without benchmarking is hard, but tools like https://polly.llvm.org/ can help. I've added #2996 as a reminder to consider adding this to the Numba LLVM builds. If your problem is not amenable to text based generation, either mutation of |
Closing this question as it seems to be resolved. Numba now has a discourse forum https://numba.discourse.group/ which is great for questions like this, please do consider posting there in future :) Thanks! |
Feature request
I have a function with three nested range loops. I optimize it using
@njit(parallel=True)
.The size of the ranges really depends, and some can be trivial (
range(1)
).Doing some testing I observed that the only way to parallelize all three is to use prange on each.
But now, I observe that permuting the three loops changes the performance (my loop sizes are very unequal, one range is ~2000 and the other ~20), which is understandable (I think only the outermost prange is used).
What do you recommend? The best solution would be to support a way for numba to understand that you can permute the loops depending on their size / for cache hit optimization (think generalized prange). Numba could maybe even "test" the order and select the best one, while running (not sure it makes any sense).
Nested loops have this structure:
If there is no code before or after or if it is fast, supporting a permutable version of
itertools.product
would be sufficient (like a pproduct).Else, one should ensure that the "code before" in
prange(n2)
doesn't use i1 by computing a dependency graph or providing a syntax for the user to specify it. For example:The text was updated successfully, but these errors were encountered: