Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: optimize permutable nested prange #2960

Closed
louisabraham opened this issue May 12, 2018 · 8 comments
Closed

Feature request: optimize permutable nested prange #2960

louisabraham opened this issue May 12, 2018 · 8 comments

Comments

@louisabraham
Copy link

louisabraham commented May 12, 2018

Feature request

I have a function with three nested range loops. I optimize it using @njit(parallel=True).

The size of the ranges really depends, and some can be trivial (range(1)).

Doing some testing I observed that the only way to parallelize all three is to use prange on each.

But now, I observe that permuting the three loops changes the performance (my loop sizes are very unequal, one range is ~2000 and the other ~20), which is understandable (I think only the outermost prange is used).

What do you recommend? The best solution would be to support a way for numba to understand that you can permute the loops depending on their size / for cache hit optimization (think generalized prange). Numba could maybe even "test" the order and select the best one, while running (not sure it makes any sense).

Nested loops have this structure:

for i1 in prange(n1):
    # code before
    for i2 in prange(n2):
        # recursive structure
    # code after

If there is no code before or after or if it is fast, supporting a permutable version of itertools.product would be sufficient (like a pproduct).

for i1, i2, i3 in pproduct(range(n1), range(n2), range(n3)):
    # code before 1
    # code before 2
    # code before 3
    # code inside
    # code after 3
    # code after 2
    # code after 1

Else, one should ensure that the "code before" in prange(n2) doesn't use i1 by computing a dependency graph or providing a syntax for the user to specify it. For example:

l1, l2, l3 = permutable(range(n1), range(n2), range(n3))
for i1 in l1:
    # code before
    for i2 in l2:
        # recursive structure
    # code after
@louisabraham louisabraham changed the title Feature request: optimize _permutable_ nested prange Feature request: optimize permutable nested prange May 12, 2018
@stuartarchibald
Copy link
Contributor

Thanks for the request. Extracting the code from https://stackoverflow.com/questions/50255126/numba-doesnt-parallelize-range below.

from numba import njit, prange
from time import time


@njit
def f1(n):
    s = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s


@njit
def f2(n):
    s = 0
    for i in prange(n):
        for j in prange(n):
            for k in prange(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s


@njit(parallel=True)
def f3(n):
    s = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s


@njit(parallel=True)
def f4(n):
    s = 0
    for i in prange(n):
        for j in prange(n):
            for k in prange(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s


for f in [f1, f2, f3, f4]:
    d = time()
    f(2500)
    print('%.02f' % (time() - d))

As Numba is using JIT compilation, the timing part of the code above is including the compilation time as well as the execution time in the reported time. For reference, the above gives me:

21.39
21.18
21.20
14.00

editing the timing part of the code so that it is just timing execution:

for f in [f1, f2, f3, f4]:
    f(10) # invoke once with the type used in the timed section to trigger compilation
    d = time()
    f(2500)
    print('%.02f' % (time() - d))

gives:

21.09
20.92
20.92
12.71

it doesn't make much difference as the compute part of the code is quite heavy for the given n, but it should always be considered.

As to the four variants of the loop and what is observed in the timings:

@njit
def f1(n):
    s = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s

Function f1 is a standard CPU compiled triple nested loop.

@njit
def f2(n):
    s = 0
    for i in prange(n):
        for j in prange(n):
            for k in prange(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s

Function f2 has loops declared with prange, but no parallel=True option set in njit, as a result the compiler sees prange as an alias of range.

@njit(parallel=True)
def f3(n):
    s = 0
    for i in range(n):
        for j in range(n):
            for k in range(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s

Function f3 has parallel=True so analysis for transforming the code to execute in parallel will happen, however, the analysis correctly decides that there is nothing to parallelize (no loop was declared as parallel with prange). The example referred to in the parallel documentation that contains range is parallelizing the loop body which contains many computations on arrays, it is these that are fused and transformed into a parallel region. In the example it is also worth nothing that w is a loop carried dependency (iteration i+1 needs the result from iteration i) and so an embarrassingly parallel loop execution is not possible.

@njit(parallel=True)
def f4(n):
    s = 0
    for i in prange(n):
        for j in prange(n):
            for k in prange(n):
                s += (i * k < j * j) - (i * k > j * j)
    return s

Function f4 has parallel=True and the loops are all declared with prange, this allows analysis for transforming the code to execute in parallel and as there are explicitly declared parallel loops that are suitable for parallel transformation the transform is done and the code runs more quickly.

Declaring prange on inner loops when there is an outer loop prange translates to the inner ones being run as range loops, this prevents nested parallelism and also makes it such that larger work blocks are available per thread.

The information about what parallel=True is doing can be found by setting the environment variable NUMBA_DEBUG_ARRAY_OPT_STATS, with this set the terminal states:

Function f3 has no Parfor.
Parallel for-loop #0 is produced from pattern '('prange', 'user')' at issue2960.py (41)
Parallel for-loop #1 is produced from pattern '('prange', 'user')' at issue2960.py (40)
Parallel for-loop #2 is produced from pattern '('prange', 'user')' at issue2960.py (39)
After fusion, parallel for-loop After fusion has 1 nested Parfor(s) #1.
After fusion, parallel for-loop After fusion has 2 nested Parfor(s) #1.
After fusion, function f4 has 1 parallel for-loop(s) #{2}.

Which shows f3 has no parfor transform, and f4 has 3 parallel loops identified from prange and it fuses them into a single loop (loop 2, the outer one).

In answer to your feature request... at present Numba compiles code based on the type of the arguments and not the values, it also compiles everything to machine code upfront and dispatches to compiled code based purely on type. The behaviour described in the feature request requires analysis based on run time values and so is more amenable to a tracing JIT which could feasibly analyse a loop nest instance at run time and perform dynamic loop nest optimisations. However, this is out of the scope of what Numba can do at present, I would think #2949 will help towards being able to achieve this though.

This all said, I think you will get loop nest optimisation from the LLVM backed via e.g. loop switching if you use loop bounds that are fixed at compile time, e.g. if you declare your loops having a fixed sized like range(20).

@louisabraham
Copy link
Author

louisabraham commented May 14, 2018

Hi, thanks you for this very precise answer.

Do you have any hints about the loop order that fits best? I think the best performance is achieved when the outer loop is bigger, but in a concrete world example, i, j and k could also be array indexes and SIMD would be eventually used ; so I think the order of the innermost loops counts as well.

Another temporary solution would be to write the code multiple times with different orders and select heuristically the best one. It could even be done programmatically by a pure python code that generates the different numba functions, and chooses which one is called each time. In my real world example, changing the loop order reduced my running time of more than 30% (and it wasn't even the worst possible case).

@stuartarchibald
Copy link
Contributor

stuartarchibald commented May 15, 2018

For an arbitrary loop nest with a trivial, largely memory access free, body in which loops can be interchanged there's unlikely to be an advantage of parallelizing one outer loop over the another.

What will matter in practice will be your data access patterns and instruction selection as it is unlikely that the loop body is trivial as per the examples. The loop order can in part can be guided by prior knowledge of the hardware and data layout, but should also be informed by profiling and testing. It is probably worth reading the performance tips page, and if you feel so inclined looking at the .inspect_llvm() and .inspect_asm() methods on the CPU dispatcher objects (the object returned by decorating a function with a @jit family decorator) to see the LLVM IR and assembler generated respectively.

@louisabraham
Copy link
Author

I see, thank you very much!

After reflection, I think that the problem (optimizing the permutation of range product) is too complicated to be solved by LLVM, and I don't think #2949 will help it because we are speaking of very different range values, not just some cases.

The only practical solution would be to automate the process of generating multiple loop orders and to have an heuristic to choose which one should be used. The only problem is that this heuristic should ideally depend on the LLVM IR generated.

@seibert
Copy link
Contributor

seibert commented May 23, 2018

This raises an interesting question of how to do effective metaprogramming in Python. Ideally, you would want a factory function that could generate code with different loop permutations, benchmark each, and then pick the best one for the rest of the application run. (I've used this technique to pick CUDA thread grid configurations with great success.)

Unfortunately, there isn't an easy way to do this kind of code generation in Python (probably not in most languages). There's no macro system, so much of the metaprogramming that Numba does internally is achieved by constructing strings of Python source code, execing them on the fly, and then feeding the resulting function object to the Numba compiler.

@louisabraham
Copy link
Author

Thank you, I was looking for the work metaprogramming. I think that just permuting loop products is easy. The idea here would not to benchmark them but to optimize the iteration order. Don't hesitate to correct me, I don't know very well the inners of numba (I'm starting to read the code), but I think it would be possible (although not trivial).

@stuartarchibald
Copy link
Contributor

I think the LLVM IR generated will essentially be invariant of loop permutation if the inner loop body is trivial.

The general class of problem presented as optimising loop iteration order for non-trivial cases without benchmarking is hard, but tools like https://polly.llvm.org/ can help. I've added #2996 as a reminder to consider adding this to the Numba LLVM builds.

If your problem is not amenable to text based generation, either mutation of .__code__.* and instantiating a types.CodeType then types.FunctionType can work, or AST manipulations via inspect and ast could also be successful.

@stuartarchibald
Copy link
Contributor

Closing this question as it seems to be resolved. Numba now has a discourse forum https://numba.discourse.group/ which is great for questions like this, please do consider posting there in future :) Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants