Performance hit with local temporary variables #3980

cdeil · 2019-04-16T13:58:08Z

I am using the latest released version of Numba (most recent is visible in
the change log (https://github.com/numba/numba/blob/master/CHANGE_LOG).
I have included below a minimal working reproducer (if you are unsure how
to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

It looks like Numba cannot optimise simple functions will if they contain local temporary variables on multiple lines, instead of a long expression on a single line.

With the latest Numba 0.43.1 on macOS from Anaconda and the sin(x) ** 2 + cos(x) ** 2 example from http://numba.pydata.org/numba-doc/latest/user/performance-tips.html I see this:

In [1]: import numpy as np                                                                                                                                                                                 

In [2]: import numba                                                                                                                                                                                       

In [3]: import numexpr                                                                                                                                                                                     

In [4]: def f1(x): 
   ...:     return np.sin(x) ** 2 + np.cos(x) ** 2 
   ...:                                                                                                                                                                                                    

In [5]: def f2(x): 
   ...:     s = np.sin(x) 
   ...:     c = np.cos(x) 
   ...:     return s ** 2 + c ** 2 
   ...:                                                                                                                                                                                                    

In [6]: a = np.arange(1.e7)                                                                                                                                                                                

In [7]: f1_numba = numba.jit(f1)                                                                                                                                                                           

In [8]: f2_numba = numba.jit(f2)                                                                                                                                                                           

In [9]: %timeit f1(a)                                                                                                                                                                                      
227 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit f2(a)                                                                                                                                                                                        
260 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [11]: %timeit f1_numba(a)                                                                                                                                                                                  
230 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: %timeit f2_numba(a)                                                                                                                                                                                  
457 ms ± 5.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [13]: %timeit numexpr.evaluate("sin(a) ** 2 + cos(a) ** 2")                                                                                                                                                
52.3 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is f2 a bit slower than f1? I thought Numpy creates temp variables and it doesn't matter how I write my code?
Why is f2_numba so much slower than f1_numba? Shouldn't both compile to the same function and give identical performance?

I couldn't find any information on http://numba.pydata.org/numba-doc/latest/user/performance-tips.html or anywhere in the Numba docs concerning this question of whether using temp local variables is OK or not when one wants good performance.

Could you please add some documentation giving advice / information on this point?
Or - if possible - improve Numba to make this question void by generating the same good performance one gets from a single expression?

The text was updated successfully, but these errors were encountered:

sklam · 2019-04-16T17:39:51Z

It looks like the array-expression fusion is not working across assignment.

sklam · 2019-04-17T16:41:56Z

After reviewing the implementation for array-expr rewrite, I noticed it is designed to work on a single basic-block at a time. It cannot fuse across basic-block and cannot determine if temporary variable (i.e. s and c from the code sample) is used in another block. Thus, it will reject the fusion of array-exprs assigned to a temporary variable. This behavior is verified in the testsuite.

On the other hand, the wanted optimization is already provided by parallel-accelerator suite of optimizations. By enabling it (i.e. @njit(parallel=True)..., the expected array-expression fusion will occur.

@stuartarchibald, @ehsantn , is there a way to enable the loop-fusion using sequential lowering? Maybe we can make that a default behavior.

stuartarchibald · 2019-04-17T17:15:04Z

Somewhat suspected this was the case, thanks for confirming. This is not something for production use, but yes, sequential lowering will work I think as the fusion passes will be run:

import numpy as np
import numba
import numexpr
from IPython import get_ipython
ipython = get_ipython()


def f1(x):
    return np.sin(x) ** 2 + np.cos(x) ** 2


def f2(x):
    s = np.sin(x)
    c = np.cos(x)
    return s ** 2 + c ** 2


a = np.arange(1.e4)

from numba import parfor
parfor.sequential_parfor_lowering=True

f2_numba = numba.njit(parallel=True)(f2)

f2_numba(a)

print(f2_numba.parallel_diagnostics(level=4))

this gives the diagnostic:

================================================================================
====== Parallel Accelerator Optimizing:  Function f2, issue3980.py (12)  =======
================================================================================


Parallel loop listing for  Function f2, issue3980.py (12) 
------------------------------|loop #ID
def f2(x):                    | 
    s = np.sin(x)-------------| #0
    c = np.cos(x)-------------| #1
    return s ** 2 + c ** 2----| #2
Performing sequential lowering of loops...
--------------------------------------------------------------------------------
  Trying to fuse loops #1 and #2:
    - fusion succeeded: parallel for-loop #2 is fused into for-loop #1.
  Trying to fuse loops #0 and #1:
    - fusion succeeded: parallel for-loop #1 is fused into for-loop #0.
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--0 (parallel)
+--1 (parallel)
+--2 (parallel)


--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--0 (parallel, fused with loop(s): 1, 2)


 
Parallel region 0 (loop #0) had 2 loop(s) fused.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------

Instruction hoisting:
No instruction hoisting found
--------------------------------------------------------------------------------

and then inspecting the IR post legalization gives:

label 0:
    x = arg(0, name=x)                       ['x']
    x_shape.0 = getattr(value=x, attr=shape) ['x', 'x_shape.0']
    x_size0.1 = static_getitem(value=x_shape.0, index=0, index_var=None) ['x_shape.0', 'x_size0.1']
    del x_shape.0                            []
    $np_g_var.20 = global(np: <module 'numpy' from '<pypath>/lib/python3.7/site-packages/numpy/__init__.py'>) ['$np_g_var.20']
    $empty_attr_attr.21 = getattr(value=$np_g_var.20, attr=empty) ['$empty_attr_attr.21', '$np_g_var.20']
    $np_typ_var.22 = getattr(value=$np_g_var.20, attr=float64) ['$np_g_var.20', '$np_typ_var.22']
    del $np_g_var.20                         []
    $0.15 = call $empty_attr_attr.21(x_size0.1, $np_typ_var.22, func=$empty_attr_attr.21, args=[Var(x_size0.1, issue3980.py (13)), Var($np_typ_var.22, issue3980.py (15))], kws=(), vararg=None) ['$0.15', '$empty_attr_attr.21', '$np_typ_var.22', 'x_size0.1']
    del $np_typ_var.22                       []
    del $empty_attr_attr.21                  []
    $range_g_var.93 = global(range: <class 'range'>) ['$range_g_var.93']
    $range_c_var.94 = call $range_g_var.93(x_size0.1, func=$range_g_var.93, args=[Var(x_size0.1, issue3980.py (13))], kws=(), vararg=None) ['$range_c_var.94', '$range_g_var.93', 'x_size0.1']
    del x_size0.1                            []
    del $range_g_var.93                      []
    $iter_var.95 = getiter(value=$range_c_var.94) ['$iter_var.95', '$range_c_var.94']
    del $range_c_var.94                      []
    jump 5                                   []
label 5:
    $iternext_var.97 = iternext(value=$iter_var.95) ['$iter_var.95', '$iternext_var.97']
    $pair_first_var.98 = pair_first(value=$iternext_var.97) ['$iternext_var.97', '$pair_first_var.98']
    $pair_second_var.99 = pair_second(value=$iternext_var.97) ['$iternext_var.97', '$pair_second_var.99']
    del $iternext_var.97                     []
    parfor_index.5 = $pair_first_var.98      ['$pair_first_var.98', 'parfor_index.5']
    del $pair_first_var.98                   []
    branch $pair_second_var.99, 6, 7         ['$pair_second_var.99']
label 6:
    del $pair_second_var.99                  []
    $arg_out_var.10 = getitem(value=x, index=parfor_index.5) ['$arg_out_var.10', 'parfor_index.5', 'x']
    $0.1 = global(np: <module 'numpy' from '<pypath>/lib/python3.7/site-packages/numpy/__init__.py'>) ['$0.1']
    $0.2.11 = getattr(value=$0.1, attr=sin)  ['$0.1', '$0.2.11']
    del $0.1                                 []
    $expr_out_var.9 = call $0.2.11($arg_out_var.10, func=$0.2.11, args=[Var($arg_out_var.10, issue3980.py (13))], kws=(), vararg=None) ['$0.2.11', '$arg_out_var.10', '$expr_out_var.9']
    del $arg_out_var.10                      []
    del $0.2.11                              []
    $arg_out_var.17 = getitem(value=x, index=parfor_index.5) ['$arg_out_var.17', 'parfor_index.5', 'x']
    $0.5 = global(np: <module 'numpy' from '<pypath>/lib/python3.7/site-packages/numpy/__init__.py'>) ['$0.5']
    $0.6.18 = getattr(value=$0.5, attr=cos)  ['$0.5', '$0.6.18']
    del $0.5                                 []
    $expr_out_var.16 = call $0.6.18($arg_out_var.17, func=$0.6.18, args=[Var($arg_out_var.17, issue3980.py (14))], kws=(), vararg=None) ['$0.6.18', '$arg_out_var.17', '$expr_out_var.16']
    del $arg_out_var.17                      []
    del $0.6.18                              []
    $arg_out_var.26 = const(int, 2)          ['$arg_out_var.26']
    $arg_out_var.24 = $expr_out_var.9 ** $arg_out_var.26 ['$arg_out_var.24', '$arg_out_var.26', '$expr_out_var.9']
    del $expr_out_var.9                      []
    del $arg_out_var.26                      []
    $arg_out_var.29 = const(int, 2)          ['$arg_out_var.29']
    $arg_out_var.27 = $expr_out_var.16 ** $arg_out_var.29 ['$arg_out_var.27', '$arg_out_var.29', '$expr_out_var.16']
    del $expr_out_var.16                     []
    del $arg_out_var.29                      []
    $expr_out_var.23 = $arg_out_var.24 + $arg_out_var.27 ['$arg_out_var.24', '$arg_out_var.27', '$expr_out_var.23']
    del $arg_out_var.27                      []
    del $arg_out_var.24                      []
    $0.15[parfor_index.5] = $expr_out_var.23 ['$0.15', '$expr_out_var.23', 'parfor_index.5']
    del parfor_index.5                       []
    del $expr_out_var.23                     []
    jump 5                                   []
label 7:
    del x                                    []
    del parfor_index.5                       []
    del $pair_second_var.99                  []
    del $iter_var.95                         []
    $0.16 = cast(value=$0.15)                ['$0.15', '$0.16']
    del $0.15                                []
    return $0.16                             ['$0.16']

block label 6 is the fused kernel that is driven by iteration in label 5.

sklam · 2019-04-17T17:29:53Z

it looks like we should consider moving parfor.sequential_parfor_lowering=True into a jit option.

cdeil · 2019-04-18T09:19:19Z

I think if you can apply this optimisation by default with numba.jit it would be a big win:

more predictable performance, Python and Numpy users expect the same performance whether they put a long expression or split it on a few lines and use local variables for the parts for better readability.
make the statement "Numba can often speed up your Numpy code" true for many more cases - currently I find that for many of my functions that contain a few lines of Numpy (similar to the one above) - Numba will either have the same performance as Numpy (like f1_numba above), or make it slower (like f2_numba above)

Why not do this optimization by default?

Is it hard to implement in the default numba.jit?
Are there other concerns, such as e.g. slower JIT compile time for long functions?

Thank you for all your work on Numba, it is wonderful!

ehsantn · 2019-04-22T15:42:34Z

Yes, I think this should be a jit option. ParallelAcclerator's optimizations including fusion are quite useful in general without the threading backend.

stuartarchibald added the needtriage label Apr 16, 2019

sklam self-assigned this Apr 16, 2019

sklam added enhancement and removed needtriage labels Apr 16, 2019

sklam added the discussion An issue requiring discussion label Apr 17, 2019

cdeil mentioned this issue Apr 29, 2019

Improve support on debugging Numba #2788

Closed

stuartarchibald mentioned this issue Jun 17, 2019

not helping the np.sign/np.abs/np.power #4190

Open

stuartarchibald added the performance performance related issue label Jul 25, 2019

stuartarchibald mentioned this issue Jul 25, 2019

typed Dict slower than array with custom dtype #4364

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance hit with local temporary variables #3980

Performance hit with local temporary variables #3980

cdeil commented Apr 16, 2019

sklam commented Apr 16, 2019

sklam commented Apr 17, 2019

stuartarchibald commented Apr 17, 2019

sklam commented Apr 17, 2019 •

edited

cdeil commented Apr 18, 2019

ehsantn commented Apr 22, 2019

Performance hit with local temporary variables #3980

Performance hit with local temporary variables #3980

Comments

cdeil commented Apr 16, 2019

sklam commented Apr 16, 2019

sklam commented Apr 17, 2019

stuartarchibald commented Apr 17, 2019

sklam commented Apr 17, 2019 • edited

cdeil commented Apr 18, 2019

ehsantn commented Apr 22, 2019

sklam commented Apr 17, 2019 •

edited