# Optimizations

## Type Stability

In [1]:
try
    using BenchmarkTools
catch ArgumentError
    using Pkg # imports Julia's built-in (excellent) package manager
    Pkg.add("BenchmarkTools") # installs a package to Julia
    using BenchmarkTools
end

### Basics

#### Global Variables

In [2]:
a = 2
function plusmulta_bad(n)
    res = 0
    for i=1:n
        res += i*a
    end
    res
end     

plusmulta_bad (generic function with 1 method)

In [3]:
@btime plusmulta_bad(1_000_000)

  108.224 ms (2999212 allocations: 45.76 MiB)


1000001000000

In [4]:
@code_warntype plusmulta_bad(1_000_000)

Variables
  #self#[36m::Core.Compiler.Const(plusmulta_bad, false)[39m
  n[36m::Int64[39m
  res[91m[1m::Any[22m[39m
  @_4[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m       (res = 0)
[90m│  [39m %2  = (1:n)[36m::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])[39m
[90m│  [39m       (@_4 = Base.iterate(%2))
[90m│  [39m %4  = (@_4 === nothing)[36m::Bool[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_4::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = res[91m[1m::Any[22m[39m
[90m│  [39m %11 = (i * Main.a)[91m[1m::Any[22m[39m
[90m│  [39m       (res = %10 + %11)
[90m│  [39m       (@_4 = Base.iterate(%2, %9))
[90m│  [39m %14 = (@_4 === nothing)[36m::Bool

This function is really slow because the type of the global variable *a* is not fixed.

In [5]:
const a2 = 2
function plusmulta_good(n)
    res = 0
    for i=1:n
        res += i*a2
    end
    res
end     

plusmulta_good (generic function with 1 method)

Solution 1: make the global variable a constant.

In [6]:
@btime plusmulta_good(1_000_000)

  1.736 ns (0 allocations: 0 bytes)


1000001000000

In [7]:
@code_warntype plusmulta_good(1_000_000)

Variables
  #self#[36m::Core.Compiler.Const(plusmulta_good, false)[39m
  n[36m::Int64[39m
  res[36m::Int64[39m
  @_4[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[36m::Int64[39m
[90m1 ─[39m       (res = 0)
[90m│  [39m %2  = (1:n)[36m::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])[39m
[90m│  [39m       (@_4 = Base.iterate(%2))
[90m│  [39m %4  = (@_4 === nothing)[36m::Bool[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_4::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = res[36m::Int64[39m
[90m│  [39m %11 = (i * Main.a2)[36m::Int64[39m
[90m│  [39m       (res = %10 + %11)
[90m│  [39m       (@_4 = Base.iterate(%2, %9))
[90m│  [39m %14 = (@_4 === nothing)[36m::Bool[39m
[90m│  [39m %15 = 

In [8]:
@code_llvm plusmulta_good(1_000_000)


;  @ In[5]:3 within `plusmulta_good'
define i64 @julia_plusmulta_good_16640(i64) {
top:
;  @ In[5]:4 within `plusmulta_good'
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:275 within `Type'
; ││┌ @ range.jl:280 within `unitrange_last'
; │││┌ @ operators.jl:341 within `>='
; ││││┌ @ int.jl:424 within `<='
       %1 = icmp sgt i64 %0, 0
; └└└└└
  br i1 %1, label %L7.L12_crit_edge, label %L29

L7.L12_crit_edge:                                 ; preds = %top
  %2 = shl i64 %0, 2
  %3 = add nsw i64 %0, -1
  %4 = add nsw i64 %0, -2
  %5 = mul i64 %3, %4
  %6 = and i64 %5, -2
  %7 = add i64 %2, %6
  %8 = add i64 %7, -2
;  @ In[5]:7 within `plusmulta_good'
  br label %L29

L29:                                              ; preds = %L7.L12_crit_edge, %top
  %value_phi9 = phi i64 [ 0, %top ], [ %8, %L7.L12_crit_edge ]
  ret i64 %value_phi9
}


Actually, the compiler optimized the for-loop away.

In [9]:
function plusmulta_good(n, a)
    res = 0
    for i=1:n
        res += i*a
    end
    res
end     

plusmulta_good (generic function with 2 methods)

In [10]:
@btime plusmulta_good(1_000_000, 2)

  1.736 ns (0 allocations: 0 bytes)


1000001000000

In [11]:
@code_llvm plusmulta_good(1_000_000, 2)


;  @ In[9]:2 within `plusmulta_good'
define i64 @julia_plusmulta_good_16655(i64, i64) {
top:
;  @ In[9]:3 within `plusmulta_good'
; ┌ @ range.jl:5 within `Colon'
; │┌ @ range.jl:275 within `Type'
; ││┌ @ range.jl:280 within `unitrange_last'
; │││┌ @ operators.jl:341 within `>='
; ││││┌ @ int.jl:424 within `<='
       %2 = icmp sgt i64 %0, 0
; └└└└└
  br i1 %2, label %L7.L12_crit_edge, label %L28

L7.L12_crit_edge:                                 ; preds = %top
  %3 = shl nuw i64 %0, 1
  %4 = add nsw i64 %0, -1
  %5 = zext i64 %4 to i65
  %6 = add nsw i64 %0, -2
  %7 = zext i64 %6 to i65
  %8 = mul i65 %5, %7
  %9 = lshr i65 %8, 1
  %10 = trunc i65 %9 to i64
  %11 = add i64 %3, %10
  %12 = add i64 %11, -1
  %13 = mul i64 %12, %1
;  @ In[9]:6 within `plusmulta_good'
  br label %L28

L28:                                              ; preds = %L7.L12_crit_edge, %top
  %value_phi9 = phi i64 [ 0, %top ], [ %13, %L7.L12_crit_edge ]
  ret i64 %value_phi9
}


We got the same good performance when using *a* as a method parameter.

Note that both methods (without and with *a* as parameter) are defined for the same function in this example, the concrete method is chosen according to call signature using multiple dispatch.

In [12]:
a = 2
randsum_bad(n) = begin # a rather unusual (and not recommended) way to define a function...
    res = 0
    for i = 1:n
        res += a*rand()
    end
    res
end

randsum_bad (generic function with 1 method)

In [13]:
@btime randsum_bad(1_000)

  92.300 μs (3000 allocations: 46.88 KiB)


1003.2320658465163

In [14]:
function randsum_better(n, a)
    res = 0
    for i = 1:n
        res += a*rand()
    end
    res
end

randsum_better (generic function with 1 method)

In [15]:
@btime randsum_better(1_000, $a)

  4.974 μs (0 allocations: 0 bytes)


988.3037837966322

This is a more "fair" comparison because the compiler cannot optimize the loop away. Still, type stability gives a performance improvement of a factor of 20.

#### Type Stability Inside Methods

Can we get better?

In [16]:
@code_warntype randsum_better(1_000, a)

Variables
  #self#[36m::Core.Compiler.Const(randsum_better, false)[39m
  n[36m::Int64[39m
  a[36m::Int64[39m
  res[91m[1m::Union{Float64, Int64}[22m[39m
  @_5[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[91m[1m::Union{Float64, Int64}[22m[39m
[90m1 ─[39m       (res = 0)
[90m│  [39m %2  = (1:n)[36m::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])[39m
[90m│  [39m       (@_5 = Base.iterate(%2))
[90m│  [39m %4  = (@_5 === nothing)[36m::Bool[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_5::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = res[91m[1m::Union{Float64, Int64}[22m[39m
[90m│  [39m %11 = Main.rand()[36m::Float64[39m
[90m│  [39m %12 = (a * %11)[36m::Float64[39m
[90m│  [39m   

The variable *res* is still not type-stable. It is defined as integer, but the added random numbers are float.

Let's fix this:

In [17]:
function randsum_good(n, a)
    res = 0. # note the . which makes this a Float64 number
    for i = 1:n
        res += a*rand()
    end
    res
end

randsum_good (generic function with 1 method)

In [18]:
@code_warntype randsum_good(1_000, a)

Variables
  #self#[36m::Core.Compiler.Const(randsum_good, false)[39m
  n[36m::Int64[39m
  a[36m::Int64[39m
  res[36m::Float64[39m
  @_5[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m

Body[36m::Float64[39m
[90m1 ─[39m       (res = 0.0)
[90m│  [39m %2  = (1:n)[36m::Core.Compiler.PartialStruct(UnitRange{Int64}, Any[Core.Compiler.Const(1, false), Int64])[39m
[90m│  [39m       (@_5 = Base.iterate(%2))
[90m│  [39m %4  = (@_5 === nothing)[36m::Bool[39m
[90m│  [39m %5  = Base.not_int(%4)[36m::Bool[39m
[90m└──[39m       goto #4 if not %5
[90m2 ┄[39m %7  = @_5::Tuple{Int64,Int64}[36m::Tuple{Int64,Int64}[39m
[90m│  [39m       (i = Core.getfield(%7, 1))
[90m│  [39m %9  = Core.getfield(%7, 2)[36m::Int64[39m
[90m│  [39m %10 = res[36m::Float64[39m
[90m│  [39m %11 = Main.rand()[36m::Float64[39m
[90m│  [39m %12 = (a * %11)[36m::Float64[39m
[90m│  [39m       (res = %10 + %12)
[90m│  [39m       (@_5 = Base.iterate(%2, %9)

In [19]:
@btime randsum_good(1_000, $a)

  3.621 μs (0 allocations: 0 bytes)


1039.1442226032727

An additional improvement of 30%.

### Custom Data Structures

In [20]:
abstract type MyDataTypes end

In [21]:
function fill_data!(data_array:: AbstractArray{T, 1}) where {T <: MyDataTypes}
    for i = 1:length(data_array)
        data = T(i, rand())
        data_array[i] = data
    end
end 

fill_data! (generic function with 1 method)

In [22]:
function aggregate_data(data_array:: AbstractArray{T, 1}) where {T <: MyDataTypes}
    res = zero(data_array[1].id * data_array[1].value)
    for i in eachindex(data_array)
        @inbounds row = data_array[i]
        res += row.id * row.value
    end
    res
end     

aggregate_data (generic function with 1 method)

#### Bad - Using of Abstract Types in Structures

In [23]:
struct MyBadData <: MyDataTypes
    id:: Integer
    value:: AbstractFloat
end

In [24]:
data_array_bad = Array{MyBadData, 1}(undef, 1_000)
@btime fill_data!(data_array_bad)

  34.172 μs (2489 allocations: 54.52 KiB)


In [25]:
@btime aggregate_data(data_array_bad)

  101.220 μs (2001 allocations: 31.27 KiB)


250448.92119289367

In [26]:
@code_warntype aggregate_data(data_array_bad)

Variables
  #self#[36m::Core.Compiler.Const(aggregate_data, false)[39m
  data_array[36m::Array{MyBadData,1}[39m
  res[91m[1m::Any[22m[39m
  @_4[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m
  val[36m::MyBadData[39m
  row[36m::MyBadData[39m

Body[91m[1m::Any[22m[39m
[90m1 ─[39m %1  = Base.getindex(data_array, 1)[36m::MyBadData[39m
[90m│  [39m %2  = Base.getproperty(%1, :id)[91m[1m::Integer[22m[39m
[90m│  [39m %3  = Base.getindex(data_array, 1)[36m::MyBadData[39m
[90m│  [39m %4  = Base.getproperty(%3, :value)[91m[1m::AbstractFloat[22m[39m
[90m│  [39m %5  = (%2 * %4)[91m[1m::Any[22m[39m
[90m│  [39m       (res = Main.zero(%5))
[90m│  [39m %7  = Main.eachindex(data_array)[36m::Base.OneTo{Int64}[39m
[90m│  [39m       (@_4 = Base.iterate(%7))
[90m│  [39m %9  = (@_4 === nothing)[36m::Bool[39m
[90m│  [39m %10 = Base.not_int(%9)[36m::Bool[39m
[90m└──[39m       goto #4 if not %10
[90m2 ┄[39m %12 = @_4::Tup

Using abstract data types inside user defined structures introduces a type instability which significantly reduces performance.

#### Using Concrete Types in Structures

In [27]:
struct MyGoodInflexibleData <: MyDataTypes
    id:: Int
    value:: Float64
end

In [28]:
data_array_good1 = Array{MyGoodInflexibleData, 1}(undef, 1_000)
@btime fill_data!(data_array_good1)

  5.250 μs (0 allocations: 0 bytes)


In [29]:
@btime aggregate_data(data_array_good1)

  1.472 μs (1 allocation: 16 bytes)


250311.52680232056

In [30]:
@code_warntype aggregate_data(data_array_good1)

Variables
  #self#[36m::Core.Compiler.Const(aggregate_data, false)[39m
  data_array[36m::Array{MyGoodInflexibleData,1}[39m
  res[36m::Float64[39m
  @_4[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m
  val[36m::MyGoodInflexibleData[39m
  row[36m::MyGoodInflexibleData[39m

Body[36m::Float64[39m
[90m1 ─[39m %1  = Base.getindex(data_array, 1)[36m::MyGoodInflexibleData[39m
[90m│  [39m %2  = Base.getproperty(%1, :id)[36m::Int64[39m
[90m│  [39m %3  = Base.getindex(data_array, 1)[36m::MyGoodInflexibleData[39m
[90m│  [39m %4  = Base.getproperty(%3, :value)[36m::Float64[39m
[90m│  [39m %5  = (%2 * %4)[36m::Float64[39m
[90m│  [39m       (res = Main.zero(%5))
[90m│  [39m %7  = Main.eachindex(data_array)[36m::Base.OneTo{Int64}[39m
[90m│  [39m       (@_4 = Base.iterate(%7))
[90m│  [39m %9  = (@_4 === nothing)[36m::Bool[39m
[90m│  [39m %10 = Base.not_int(%9)[36m::Bool[39m
[90m└──[39m       goto #4 if not %10
[90m2 ┄[39m 

Defining concrete data types inside a structure gives type-stability (and thus performace), but reduces flexibility - e.g. we cannot use Float32 as *value* anymore.

#### Parametric Types

In [31]:
struct MyGoodData{T <: Integer, U <: Number} <: MyDataTypes
    id:: T
    value:: U
end

In [32]:
data_array_good2 = Array{MyGoodData{Int, Float64}, 1}(undef, 1_000)
@btime fill_data!(data_array_good2)

  5.251 μs (0 allocations: 0 bytes)


In [33]:
@btime aggregate_data(data_array_good2)

  1.470 μs (1 allocation: 16 bytes)


254783.94285752467

In [34]:
@code_warntype aggregate_data(data_array_good2)

Variables
  #self#[36m::Core.Compiler.Const(aggregate_data, false)[39m
  data_array[36m::Array{MyGoodData{Int64,Float64},1}[39m
  res[36m::Float64[39m
  @_4[33m[1m::Union{Nothing, Tuple{Int64,Int64}}[22m[39m
  i[36m::Int64[39m
  val[36m::MyGoodData{Int64,Float64}[39m
  row[36m::MyGoodData{Int64,Float64}[39m

Body[36m::Float64[39m
[90m1 ─[39m %1  = Base.getindex(data_array, 1)[36m::MyGoodData{Int64,Float64}[39m
[90m│  [39m %2  = Base.getproperty(%1, :id)[36m::Int64[39m
[90m│  [39m %3  = Base.getindex(data_array, 1)[36m::MyGoodData{Int64,Float64}[39m
[90m│  [39m %4  = Base.getproperty(%3, :value)[36m::Float64[39m
[90m│  [39m %5  = (%2 * %4)[36m::Float64[39m
[90m│  [39m       (res = Main.zero(%5))
[90m│  [39m %7  = Main.eachindex(data_array)[36m::Base.OneTo{Int64}[39m
[90m│  [39m       (@_4 = Base.iterate(%7))
[90m│  [39m %9  = (@_4 === nothing)[36m::Bool[39m
[90m│  [39m %10 = Base.not_int(%9)[36m::Bool[39m
[90m└──[39m       goto #4 

Parametric data types give both type-stability (and thus performance) and flexibility and are therefore usually the best solution.

## Allocations

## Further Optimizations

The following macros could give significant speed-ups in certain situations.
However, there is a good reason why these optimizations are not enabled by default, therefore use with caution.

### Baseline

In [42]:
my_array = rand(1_000_000)

1000000-element Array{Float64,1}:
 0.48251039775571636 
 0.6060967900875225  
 0.6433419971075629  
 0.5672518352733051  
 0.6330449556462303  
 0.7847472161059443  
 0.6346399388443404  
 0.9106312549030293  
 0.7672216770021818  
 0.016045060370933006
 0.002823502264714506
 0.04043181535589446 
 0.8424429448402266  
 ⋮                   
 0.20209203603669912 
 0.7235009637225687  
 0.7721670611958109  
 0.5345527557796439  
 0.29539609857093296 
 0.35311189336195103 
 0.30548114650768987 
 0.841808211370416   
 0.6500169780553795  
 0.41714463376159006 
 0.9968664190952854  
 0.3217207185034947  

In [43]:
function test_agg(array)
    res = 0.
    for i = 1:length(array)
        res += array[i]
    end
    res
end

test_agg (generic function with 1 method)

In [44]:
@btime test_agg($my_array)

  1.826 ms (0 allocations: 0 bytes)


499669.7343965981

### Deactivation of Bounds Checks

In [45]:
function test_agg_inbounds(array)
    res = 0.
    for i = 1:length(array)
        @inbounds res += array[i]
    end
    res
end

test_agg_inbounds (generic function with 1 method)

In [46]:
@btime test_agg_inbounds($my_array)

  1.687 ms (0 allocations: 0 bytes)


499669.7343965981

In [52]:
@assert test_agg(my_array) == test_agg_inbounds(my_array)

The *@inbounds* macro disables array boundary checks and gives a speedup of ca. 10% here.

However, be careful:

In [63]:
function test_agg_bugged(array)
    res = 0.
    for i = 1:length(array)+1 # bug: loop should go to length, not length + 1!
        res += array[i]
    end
    res
end

test_agg_bugged (generic function with 1 method)

In [64]:
test_agg_bugged(my_array)

BoundsError: BoundsError: attempt to access 1000000-element Array{Float64,1} at index [1000001]

In [65]:
function test_agg_inbounds_bugged(array)
    res = 0.
    for i = 1:length(array)+1 # bug: loop should go to length, not length + 1!
        @inbounds res += array[i]
    end
    res
end

test_agg_inbounds_bugged (generic function with 1 method)

In [69]:
test_agg_inbounds_bugged(my_array)

499669.7343965981

The bug in the code is not detected because of the inbounds macro.
The result of accessing an array out of bounds is not predictable.

### SIMD

This macro makes use the the Single Instruction Multiple Data functionality of modern CPUs.

It should only be used if the loop iterations are independent and the order of iterations can be changed.

In [76]:
function test_agg_simd(array)
    res = 0.
    @simd for i = 1:length(array)
        @inbounds res += array[i]
    end
    res
end

test_agg_simd (generic function with 1 method)

In [77]:
@btime test_agg_simd($my_array)

  1.227 ms (0 allocations: 0 bytes)


499669.7343966109

In [78]:
test_agg(my_array) - test_agg_simd(my_array)

-1.2747477740049362e-8

The @simd macro gives a speedup of 20%, but changes (slightly) the calculation result, likely due to modified order of loop elements.

In [79]:
@code_llvm test_agg_simd(my_array)


;  @ In[76]:2 within `test_agg_simd'
define double @julia_test_agg_simd_17291(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
;  @ In[76]:3 within `test_agg_simd'
; ┌ @ simdloop.jl:69 within `macro expansion'
; │┌ @ array.jl:200 within `length'
    %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
    %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_array_t addrspace(11)*
    %3 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %2, i64 0, i32 1
    %4 = load i64, i64 addrspace(11)* %3, align 8
; │└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
        %5 = icmp sgt i64 %4, 0
; ││││└└
      %6 = select i1 %5, i64 %4, i64 0
; │└└└
; │ @ simdloop.jl:71 within `macro expansion'
; │┌ @ simdloop.jl:51 within `simd_inner_length'
; ││┌ @ range.jl:541 within `length'
; │││┌ @ checked.jl:

Note the operations on data types like *<2x double>*.

In [83]:
function test_agg_simd_bad(array)
    res = 0.
    @simd for i = 1:length(array)
        res += array[i]
    end
    res
end

test_agg_simd_bad (generic function with 1 method)

In [84]:
@btime test_agg_simd_bad($my_array)

  1.837 ms (0 allocations: 0 bytes)


499669.7343965981

In [85]:
@code_llvm test_agg_simd_bad(my_array)


;  @ In[83]:2 within `test_agg_simd_bad'
define double @julia_test_agg_simd_bad_17346(%jl_value_t addrspace(10)* nonnull align 16 dereferenceable(40)) {
top:
;  @ In[83]:3 within `test_agg_simd_bad'
; ┌ @ simdloop.jl:69 within `macro expansion'
; │┌ @ array.jl:200 within `length'
    %1 = addrspacecast %jl_value_t addrspace(10)* %0 to %jl_value_t addrspace(11)*
    %2 = bitcast %jl_value_t addrspace(11)* %1 to %jl_array_t addrspace(11)*
    %3 = getelementptr inbounds %jl_array_t, %jl_array_t addrspace(11)* %2, i64 0, i32 1
    %4 = load i64, i64 addrspace(11)* %3, align 8
; │└
; │┌ @ range.jl:5 within `Colon'
; ││┌ @ range.jl:275 within `Type'
; │││┌ @ range.jl:280 within `unitrange_last'
; ││││┌ @ operators.jl:341 within `>='
; │││││┌ @ int.jl:424 within `<='
        %5 = icmp sgt i64 %4, 0
; ││││└└
      %6 = select i1 %5, i64 %4, i64 0
; │└└└
; │ @ simdloop.jl:71 within `macro expansion'
; │┌ @ simdloop.jl:51 within `simd_inner_length'
; ││┌ @ range.jl:541 within `length'
; │││┌ @

Without the @inbounds macro, the array boundary checks prevent the simd optimizations - the benchmark shows no improvement w.r.t. the baseline.