Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance hit with local temporary variables #3980

Open
2 tasks done
cdeil opened this issue Apr 16, 2019 · 6 comments
Open
2 tasks done

Performance hit with local temporary variables #3980

cdeil opened this issue Apr 16, 2019 · 6 comments
Assignees
Labels
discussion An issue requiring discussion enhancement performance performance related issue

Comments

@cdeil
Copy link
Contributor

cdeil commented Apr 16, 2019


It looks like Numba cannot optimise simple functions will if they contain local temporary variables on multiple lines, instead of a long expression on a single line.

With the latest Numba 0.43.1 on macOS from Anaconda and the sin(x) ** 2 + cos(x) ** 2 example from http://numba.pydata.org/numba-doc/latest/user/performance-tips.html I see this:

In [1]: import numpy as np                                                                                                                                                                                 

In [2]: import numba                                                                                                                                                                                       

In [3]: import numexpr                                                                                                                                                                                     

In [4]: def f1(x): 
   ...:     return np.sin(x) ** 2 + np.cos(x) ** 2 
   ...:                                                                                                                                                                                                    

In [5]: def f2(x): 
   ...:     s = np.sin(x) 
   ...:     c = np.cos(x) 
   ...:     return s ** 2 + c ** 2 
   ...:                                                                                                                                                                                                    

In [6]: a = np.arange(1.e7)                                                                                                                                                                                

In [7]: f1_numba = numba.jit(f1)                                                                                                                                                                           

In [8]: f2_numba = numba.jit(f2)                                                                                                                                                                           

In [9]: %timeit f1(a)                                                                                                                                                                                      
227 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit f2(a)                                                                                                                                                                                        
260 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit f1_numba(a)                                                                                                                                                                                  
230 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit f2_numba(a)                                                                                                                                                                                  
457 ms ± 5.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit numexpr.evaluate("sin(a) ** 2 + cos(a) ** 2")                                                                                                                                                
52.3 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • Why is f2 a bit slower than f1? I thought Numpy creates temp variables and it doesn't matter how I write my code?
  • Why is f2_numba so much slower than f1_numba? Shouldn't both compile to the same function and give identical performance?

I couldn't find any information on http://numba.pydata.org/numba-doc/latest/user/performance-tips.html or anywhere in the Numba docs concerning this question of whether using temp local variables is OK or not when one wants good performance.

Could you please add some documentation giving advice / information on this point?
Or - if possible - improve Numba to make this question void by generating the same good performance one gets from a single expression?

@sklam
Copy link
Member

sklam commented Apr 16, 2019

It looks like the array-expression fusion is not working across assignment.

@sklam sklam self-assigned this Apr 16, 2019
@sklam
Copy link
Member

sklam commented Apr 17, 2019

After reviewing the implementation for array-expr rewrite, I noticed it is designed to work on a single basic-block at a time. It cannot fuse across basic-block and cannot determine if temporary variable (i.e. s and c from the code sample) is used in another block. Thus, it will reject the fusion of array-exprs assigned to a temporary variable. This behavior is verified in the testsuite.

On the other hand, the wanted optimization is already provided by parallel-accelerator suite of optimizations. By enabling it (i.e. @njit(parallel=True)..., the expected array-expression fusion will occur.

@stuartarchibald, @ehsantn , is there a way to enable the loop-fusion using sequential lowering? Maybe we can make that a default behavior.

@sklam sklam added the discussion An issue requiring discussion label Apr 17, 2019
@stuartarchibald
Copy link
Contributor

Somewhat suspected this was the case, thanks for confirming. This is not something for production use, but yes, sequential lowering will work I think as the fusion passes will be run:

import numpy as np
import numba
import numexpr
from IPython import get_ipython
ipython = get_ipython()


def f1(x):
    return np.sin(x) ** 2 + np.cos(x) ** 2


def f2(x):
    s = np.sin(x)
    c = np.cos(x)
    return s ** 2 + c ** 2


a = np.arange(1.e4)

from numba import parfor
parfor.sequential_parfor_lowering=True

f2_numba = numba.njit(parallel=True)(f2)

f2_numba(a)

print(f2_numba.parallel_diagnostics(level=4))

this gives the diagnostic:

================================================================================
====== Parallel Accelerator Optimizing:  Function f2, issue3980.py (12)  =======
================================================================================


Parallel loop listing for  Function f2, issue3980.py (12) 
------------------------------|loop #ID
def f2(x):                    | 
    s = np.sin(x)-------------| #0
    c = np.cos(x)-------------| #1
    return s ** 2 + c ** 2----| #2
Performing sequential lowering of loops...
--------------------------------------------------------------------------------
  Trying to fuse loops #1 and #2:
    - fusion succeeded: parallel for-loop #2 is fused into for-loop #1.
  Trying to fuse loops #0 and #1:
    - fusion succeeded: parallel for-loop #1 is fused into for-loop #0.
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--0 (parallel)
+--1 (parallel)
+--2 (parallel)


--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--0 (parallel, fused with loop(s): 1, 2)


 
Parallel region 0 (loop #0) had 2 loop(s) fused.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------

Instruction hoisting:
No instruction hoisting found
--------------------------------------------------------------------------------

and then inspecting the IR post legalization gives:

label 0:
    x = arg(0, name=x)                       ['x']
    x_shape.0 = getattr(value=x, attr=shape) ['x', 'x_shape.0']
    x_size0.1 = static_getitem(value=x_shape.0, index=0, index_var=None) ['x_shape.0', 'x_size0.1']
    del x_shape.0                            []
    $np_g_var.20 = global(np: <module 'numpy' from '<pypath>/lib/python3.7/site-packages/numpy/__init__.py'>) ['$np_g_var.20']
    $empty_attr_attr.21 = getattr(value=$np_g_var.20, attr=empty) ['$empty_attr_attr.21', '$np_g_var.20']
    $np_typ_var.22 = getattr(value=$np_g_var.20, attr=float64) ['$np_g_var.20', '$np_typ_var.22']
    del $np_g_var.20                         []
    $0.15 = call $empty_attr_attr.21(x_size0.1, $np_typ_var.22, func=$empty_attr_attr.21, args=[Var(x_size0.1, issue3980.py (13)), Var($np_typ_var.22, issue3980.py (15))], kws=(), vararg=None) ['$0.15', '$empty_attr_attr.21', '$np_typ_var.22', 'x_size0.1']
    del $np_typ_var.22                       []
    del $empty_attr_attr.21                  []
    $range_g_var.93 = global(range: <class 'range'>) ['$range_g_var.93']
    $range_c_var.94 = call $range_g_var.93(x_size0.1, func=$range_g_var.93, args=[Var(x_size0.1, issue3980.py (13))], kws=(), vararg=None) ['$range_c_var.94', '$range_g_var.93', 'x_size0.1']
    del x_size0.1                            []
    del $range_g_var.93                      []
    $iter_var.95 = getiter(value=$range_c_var.94) ['$iter_var.95', '$range_c_var.94']
    del $range_c_var.94                      []
    jump 5                                   []
label 5:
    $iternext_var.97 = iternext(value=$iter_var.95) ['$iter_var.95', '$iternext_var.97']
    $pair_first_var.98 = pair_first(value=$iternext_var.97) ['$iternext_var.97', '$pair_first_var.98']
    $pair_second_var.99 = pair_second(value=$iternext_var.97) ['$iternext_var.97', '$pair_second_var.99']
    del $iternext_var.97                     []
    parfor_index.5 = $pair_first_var.98      ['$pair_first_var.98', 'parfor_index.5']
    del $pair_first_var.98                   []
    branch $pair_second_var.99, 6, 7         ['$pair_second_var.99']
label 6:
    del $pair_second_var.99                  []
    $arg_out_var.10 = getitem(value=x, index=parfor_index.5) ['$arg_out_var.10', 'parfor_index.5', 'x']
    $0.1 = global(np: <module 'numpy' from '<pypath>/lib/python3.7/site-packages/numpy/__init__.py'>) ['$0.1']
    $0.2.11 = getattr(value=$0.1, attr=sin)  ['$0.1', '$0.2.11']
    del $0.1                                 []
    $expr_out_var.9 = call $0.2.11($arg_out_var.10, func=$0.2.11, args=[Var($arg_out_var.10, issue3980.py (13))], kws=(), vararg=None) ['$0.2.11', '$arg_out_var.10', '$expr_out_var.9']
    del $arg_out_var.10                      []
    del $0.2.11                              []
    $arg_out_var.17 = getitem(value=x, index=parfor_index.5) ['$arg_out_var.17', 'parfor_index.5', 'x']
    $0.5 = global(np: <module 'numpy' from '<pypath>/lib/python3.7/site-packages/numpy/__init__.py'>) ['$0.5']
    $0.6.18 = getattr(value=$0.5, attr=cos)  ['$0.5', '$0.6.18']
    del $0.5                                 []
    $expr_out_var.16 = call $0.6.18($arg_out_var.17, func=$0.6.18, args=[Var($arg_out_var.17, issue3980.py (14))], kws=(), vararg=None) ['$0.6.18', '$arg_out_var.17', '$expr_out_var.16']
    del $arg_out_var.17                      []
    del $0.6.18                              []
    $arg_out_var.26 = const(int, 2)          ['$arg_out_var.26']
    $arg_out_var.24 = $expr_out_var.9 ** $arg_out_var.26 ['$arg_out_var.24', '$arg_out_var.26', '$expr_out_var.9']
    del $expr_out_var.9                      []
    del $arg_out_var.26                      []
    $arg_out_var.29 = const(int, 2)          ['$arg_out_var.29']
    $arg_out_var.27 = $expr_out_var.16 ** $arg_out_var.29 ['$arg_out_var.27', '$arg_out_var.29', '$expr_out_var.16']
    del $expr_out_var.16                     []
    del $arg_out_var.29                      []
    $expr_out_var.23 = $arg_out_var.24 + $arg_out_var.27 ['$arg_out_var.24', '$arg_out_var.27', '$expr_out_var.23']
    del $arg_out_var.27                      []
    del $arg_out_var.24                      []
    $0.15[parfor_index.5] = $expr_out_var.23 ['$0.15', '$expr_out_var.23', 'parfor_index.5']
    del parfor_index.5                       []
    del $expr_out_var.23                     []
    jump 5                                   []
label 7:
    del x                                    []
    del parfor_index.5                       []
    del $pair_second_var.99                  []
    del $iter_var.95                         []
    $0.16 = cast(value=$0.15)                ['$0.15', '$0.16']
    del $0.15                                []
    return $0.16                             ['$0.16']

block label 6 is the fused kernel that is driven by iteration in label 5.

@sklam
Copy link
Member

sklam commented Apr 17, 2019

it looks like we should consider moving parfor.sequential_parfor_lowering=True into a jit option.

@cdeil
Copy link
Contributor Author

cdeil commented Apr 18, 2019

I think if you can apply this optimisation by default with numba.jit it would be a big win:

  • more predictable performance, Python and Numpy users expect the same performance whether they put a long expression or split it on a few lines and use local variables for the parts for better readability.
  • make the statement "Numba can often speed up your Numpy code" true for many more cases - currently I find that for many of my functions that contain a few lines of Numpy (similar to the one above) - Numba will either have the same performance as Numpy (like f1_numba above), or make it slower (like f2_numba above)

Why not do this optimization by default?

  • Is it hard to implement in the default numba.jit?
  • Are there other concerns, such as e.g. slower JIT compile time for long functions?

Thank you for all your work on Numba, it is wonderful!

@ehsantn
Copy link
Collaborator

ehsantn commented Apr 22, 2019

Yes, I think this should be a jit option. ParallelAcclerator's optimizations including fusion are quite useful in general without the threading backend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion An issue requiring discussion enhancement performance performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants