# Introduction

Julia has a number of macros that can improve performance, sometimes considerably.

# Requirements

In [1]:
using BenchmarkTools

# Array views

Slicing an array in Julia creates a copy, as opposed to numpy.  Since this has an impact on performance, Julia defines a macro to indicate you want a view, not a copy.

To illustrate, consider a function to compute the sume of elements of an array.

In [2]:
function sum_vector(x)
    total = 0
    for i in 1:length(x)
        total += x[i]
    end
    return total
end

sum_vector (generic function with 1 method)

We can use this function to compute the column-wise sum of a two-dimensional array.

In [3]:
function sum_columns(A)
    total = zeros(size(A, 2))
    for i in 1:size(A, 2)
        total[i] = sum_vector(A[:, i])
    end
    return total
end

sum_columns (generic function with 1 method)

The second version of this function only differs in the application of the macro `@view`.

In [4]:
function sum_columns_view(A)
    total = zeros(size(A, 2))
    for i in 1:size(A, 2)
        total[i] = sum_vector(@view(A[:, i]))
    end
    return total
end

sum_columns_view (generic function with 1 method)

Let's verify the implementation.

In [5]:
A = [ 1 2 3; 4 5 6 ]

2×3 Matrix{Int64}:
 1  2  3
 4  5  6

In [6]:
sum_columns(A)

3-element Vector{Float64}:
 5.0
 7.0
 9.0

Now we can benchmark the two implementations.

In [7]:
A = rand(1000, 1000);

In [8]:
@benchmark sum_columns(A)

BechmarkTools.Trial: 1201 samples with 1 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m3.502 ms[22m[39m … [35m  6.916 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 19.50%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m3.771 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m4.121 ms[22m[39m ± [32m704.506 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m5.66% ±  9.43%

  [39m [39m [39m▄[39m█[39m█[39m▇[34m▆[39m[39m▄[39m▃[39m▂[39m [39m▂[39m [32m [39m[39m [39m [39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m▂[39m [39m▂[39m▂[39m▁[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m▇[39m█[39m█[39m█[39m█[39m█[

In [9]:
@benchmark sum_columns_view(A)

BechmarkTools.Trial: 1932 samples with 1 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m2.392 ms[22m[39m … [35m  3.860 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m2.564 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m2.558 ms[22m[39m ± [32m109.650 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m [39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [32m▂[39m[34m█[39m[39m▂[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m 
  [39m▂[39m▃[39m▂[39m▃[39m█[39m▅[39

The difference is quite significant, since using array views, we save on memory allocations and garbage collection.

# Array bounds checking

Although it is of course useful to have array bounds checked at runtime, this does generate some overhead.

Consider the following function that implements the vector triad daxpy.

In [10]:
function vector_triad(x, y, α)
    for i in 1:length(x)
        x[i] += α*x[i] + y[i]
    end
end

vector_triad (generic function with 1 method)

The second version of this function uses the `@inbounds` macro to indicate to the compiler that the programmer, i.e., you, is sure that any array access in the iteration will be within the bounds of the arrays.  Hence the compiler will not generate code to check.

In [11]:
function vector_triad_no_check(x, y, α)
    @inbounds for i in 1:length(x)
        x[i] += α*x[i] + y[i]
    end
end

vector_triad_no_check (generic function with 1 method)

Benchmarking this illustrates that, indeed, runtime array bounds checking generates significant overhead.

In [12]:
x = rand(100_000); y = rand(100_000); α = 3.1;

In [13]:
@benchmark vector_triad(x, y, α)

BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m81.701 μs[22m[39m … [35m 1.673 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m82.102 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m86.846 μs[22m[39m ± [32m25.627 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [34m█[39m[39m▁[32m▃[39m[39m [39m▃[39m [39m [39m [39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [34m█[39m[39m█[32m█[39m[39m▇[39m

In [14]:
@benchmark vector_triad_no_check(x, y, α)

BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m56.901 μs[22m[39m … [35m306.705 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m58.001 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m61.483 μs[22m[39m ± [32m 17.726 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m█[34m▂[39m[32m▁[39m[39m▁[39m [39m▃[39m [39m [39m [39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [39m█[34m█[39m[32m█[39m[

Using the `@inbounds` macro results in a reduced execution time.  However, it is important to realize that if you misjudge, and the array is accessed out of bounds, your code will not produce correct results and in the best case, it will crash.

# Single Instruction, Multiple Data (SIMD)

Julia has a macro to instruct the compiler to generate vector instructions.  In the implementation of the vector triad below, we add the `@simd` macro (note that the `@inbounds` macro is used as well.

In [15]:
function vector_triad_simd(x, y, α)
    @inbounds @simd for i in 1:length(x)
        x[i] += α*x[i] + y[i]
    end
end

vector_triad_simd (generic function with 1 method)

Benchmarking this version illustrates that the use of vector instruction can significantly improve performance.

In [16]:
x = rand(100_000); y = rand(100_000); α = 3.1;

In [17]:
@benchmark vector_triad_no_check(x, y, α)

BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m57.601 μs[22m[39m … [35m 1.668 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m58.301 μs              [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m61.959 μs[22m[39m ± [32m29.384 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [34m█[39m[32m▂[39m[39m▁[39m [39m▃[39m▁[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [34m█[39m[32m█[39m[39m█[39m█[39m

In [18]:
@benchmark vector_triad_simd(x, y, α)

BechmarkTools.Trial: 10000 samples with 1 evaluations.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m25.100 μs[22m[39m … [35m203.703 μs[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m25.700 μs               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m27.522 μs[22m[39m ± [32m 12.962 μs[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [34m█[39m[32m▂[39m[39m▁[39m▃[39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁
  [34m█[39m[32m█[39m[39m█[