# Performance engineering and optimisation


In this notebook we first review a few standard pitfalls for performance in Julia and then get our hands dirty in optimising a few pieces of code ourselves.

For more details on the issues mentioned here, see the [performance tips](https://docs.julialang.org/en/v1/manual/performance-tips/) and this [blog article](https://www.stochasticlifestyle.com/7-julia-gotchas-handle/) by Chris Rackauckas.

## Pitfall 1: Global scope

In [None]:
a = 2.0
b = 3.0
function linear_combination()
  2a + b
end
answer = linear_combination()

@show answer;

@code_warntype linear_combination()

Even though all types are known, the compiler does not make use of them. The reason is that in global scope (such as a Jupyter notebook or the REPL) there are no guarantees that `a` and `b` are of a certain type as any later reassignment might change this.

In [None]:
using Traceur
@trace linear_combination()

### Solution 1a: Wrap code in functions

Sounds simple, but this is often a very good to (not only this) performance problem.

In [None]:
function outer()
    a = 2.0
    b = 3.0
    function linear_combination()
      2a + b
    end
    linear_combination() 
end
answer = outer()
@show answer;

@code_warntype outer()

Notice that **constant propagation** is even possible in this case (i.e. Julia will do the computation at compile time):

In [None]:
@code_llvm outer()

All the code has been contracted to a single statement in the LLVM bytecode: At runtime this function will just return the result.

**Key message:** Write functions, not scripts!

## Pitfall 2: Type-instabilities

The following function looks innocent ...

In [None]:
function g()
    x = 1
    for i = 1:10
        x = x / 2
    end
    x
end

... but is not:

In [None]:
@code_warntype debuginfo=:none g()

The issue is that the type of the accumulator `x` changes *during the iterations*!

### Solution 2a: Avoid type change

In [None]:
function h()
    x = 1.0
    for i = 1:10
        x = x / 2
    end
    x
end

In [None]:
@code_warntype debuginfo=:none h()

In [None]:
@code_llvm debuginfo=:none h()

(Side note: Things are actually not *too* bad in this case, as `Float64` and `Int64` have the same bit width, so Julia con do a cool thing called *union splitting*, see https://julialang.org/blog/2018/08/union-splitting)

### Solution 2b: Specify types explicitly

... the Fortran / C way ;)

In [None]:
function g2()
    x::Float64 = 1  # Enforces conversion to Float64
    for i = 1:10
        x = x / 2
    end
    x
end

In [None]:
@code_llvm debuginfo=:none g2()

### Solution 2c: Function barriers

In [None]:
data = Union{Int64,Float64,String}[4, 2.0, "test", 3.2, 1]

In [None]:
function calc_square(x)
  for i in eachindex(x)
    val = x[i]
    val^2
  end
end

In [None]:
@code_warntype calc_square(data)

In [None]:
function calc_square_outer(x)
  for i in eachindex(x)
    calc_square_inner(x[i])
  end
end

calc_square_inner(x) = x^2

In [None]:
@code_warntype calc_square_inner(data[1])

## Pitfall 3: Multiple returns

A common pattern is code such as:

In [None]:
function f(x, flag)
    if flag
        return 1:3
    else
        return [4, 5, 6]
    end
end

However this is not type stable!

In [None]:
@code_warntype f(rand(10), true)

since clearly:

In [None]:
typeof(f(rand(10), true))

In [None]:
typeof(f(rand(10), false))

### Solution 3a: Use a single return

In [None]:
function f1(x, flag)
    result = Vector{Int64}(undef, 3)
    if flag
        result .= 1:3
    else
        result .= [4, 5, 6]
    end
    result
end

@code_warntype f1(rand(10), true)

## Pitfall 4: Views and copies

By default slicing into a matrix, actually returns a copy and not a view.

In [None]:
using BenchmarkTools, LinearAlgebra

M = rand(3, 3);
x = rand(3);

In [None]:
f(x, M) = dot(M[1:3, 1], x)       # Implicit copy
@btime f($x, $M);                 # ($ syntax in BenchmarkTools to avoid global scope
                                  #  ... otherwise numbers could be less meaningful.)

In [None]:
g(x, M) = dot(view(M, 1:3, 1), x)  # Avoids the copy
@btime g($x, $M);

In [None]:
g(x, M) = @views dot(M[1:3, 1], x)  # More convenient
@btime g($x, $M);

## Pitfall 5: Temporary allocations and vectorised code

In [None]:
using BenchmarkTools

function f()
    x = [1; 5; 6]  # Column-vector
    for i in 1:100_000
       x = x + 2*x
    end
    x
end

@btime f();

### Solution 5a: Use dot syntax!

The vectorisation syntax (`.`) we already talked about is a semantic syntax to enforce loop fusion (see blog post by Steven G. Johnson: https://julialang.org/blog/2017/01/moredots), which avoids temporaries and thus speeds up computations.

In [None]:
function f1()
    x = [1; 5; 6]
    for i in 1:100_000
        x .= x .+ 2 .* x
        # @. x = x + 2*x   # equivalent
    end
    x
end
@btime f1();

Notice the 10-fold speedup!

### Solution 5b: Use immutable datatypes

In an attempt to go faster one might be tempted to write the loop-fusion explicitly (and use `@inbounds`):

In [None]:
function f2()
    x = [1; 5; 6]
    @inbounds for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    x
end
@btime f2();

This helps, but even faster is to let the compiler know about the size directly:

In [None]:
using StaticArrays

function f3()
  x = @SVector [1; 5; 6]
  for i in 1:100_000
    x = @. x + 2*x
  end
  x
end

@btime f3();

## Pitfall 6: Abstract fields

(See also the project on [custom types](12_Types_Specialisation.ipynb)).

In [None]:
using BenchmarkTools

In [None]:
struct MyType
    x::AbstractFloat
    y::AbstractString
end

f(a::MyType) = a.x^2 + sqrt(a.x)

In [None]:
a = MyType(3.0, "test")

@btime f($a);

### Solution 6a: Use concrte types in structs

In [None]:
struct MyTypeConcrete
    x::Float64
    y::String
end

f(b::MyTypeConcrete) = b.x^2 + sqrt(b.x)

In [None]:
b = MyTypeConcrete(3.0, "test")

@btime f($b);

Note that the latter implementation is **more than 30x faster**!

### Solution 6b: If generic content is needed

Use [parametric types](12_Types_Specialisation.ipynb).

In [None]:
struct MyTypeParametric{A<:AbstractFloat, B<:AbstractString}
    x::A
    y::B
end

f(c::MyTypeParametric) = c.x^2 + sqrt(c.x)

In [None]:
c = MyTypeParametric(3.0, "test")

While this makes the code a little less readable (field types and stack traces are now less meaningful),
the compiler is able to produce optimal code, since the types of `x` and `y` are encoded in the type of the struct:

In [None]:
@btime f($c);

In [None]:
c = MyTypeParametric(Float32(3.0), SubString("test"))

In [None]:
@btime f($c);

## Pitfall 7: Column major order

Unlike C or numpy (but like MATLAB and FORTRAN), Julia uses column-major ordering in matrices:

In [None]:
M = reshape(1:9, 3, 3)

In [None]:
@show M[1, 2] M[2, 2] M[3, 2]

i.e. **earlier indices run faster**!

Neglecting this leads to a performance penalty:

In [None]:
M = rand(1000, 1000);

function fcol(M)
    for col in 1:size(M, 2)
        for row in 1:size(M, 1)
            M[row, col] = 42
        end
    end
    nothing
end

function frow(M)
    for row in 1:size(M, 1)
        for col in 1:size(M, 2)
            M[row, col] = 42
        end
    end
    nothing
end

In [None]:
@btime fcol($M)

In [None]:
@btime frow($M)

## Pitfall 8: Allocating linear algebra operations

By default Julia tries to be safe, which means that many linear algebra operations allocate memory. E.g. consider the following matrix-vector products:

In [None]:
function f()
    A = rand(100, 100)
    B = rand(100, 100)
    s = 0.0
    for i in 1:1000
        C = A * B
        s += C[i]
    end
    s
end

@btime f();

### Solution 8a: Preallocate memory + inplace functions:

In [None]:
using LinearAlgebra

function f1()
    A = rand(100, 100)
    B = rand(100, 100)
    C = zeros(100, 100)  # Preallocate
    s = 0.0
    for i in 1:1000
        mul!(C, A, B)    # In-place matmul
        s += C[i]
    end
    s
end

@btime f1();

## Performance takeaways

* Gotcha 1: **Wrap code in self-contained functions** in performance critical applications, i.e. avoid global scope.
* Gotcha 2: Write **type-stable code** (check with `@code_warntype`).
* Gotcha 3: Use **views** instead of copies to avoid unnecessary allocations.
* Gotcha 4: Use **broadcasting (more dots)** to avoid temporary allocations in vectorized code (or write out loops).
* Gotcha 5: **Types should always have concrete fields.** If you don't know them in advance, use type parameters.
* Gotcha 6: Be aware of **column major order** when looping over arrays.


##### More details
-  Check out [this MIT lecture](https://mitmath.github.io/18337/lecture2/optimizing).

## Extra performance tips

Compared to python and C, Julia puts a much stronger emphasis on functional programming,
which often allows to write concise code which *avoids allocations*. For example

In [None]:
using BenchmarkTools
function myfun_naive(x)
    x_mod = @. abs(2x - x)
    minimum(x_mod)
end

x = randn(10_000)
@btime myfun_naive($x);

Now, `minimum` allows to take a function as first argument. This function is applied *elementwise* before doing the standard action of `minimum` (taking the minimum):

In [None]:
function myfun_fast(x)
    minimum(xi -> abs(2xi - xi), x)
end
@btime myfun_fast($x);

A convenience syntax allows to write this even nicer for more complicated expressions:

In [None]:
function myfun_fast(x)
    minimum(x) do xi
        abs(2xi - xi)
    end
end

This is equivalent to the first definition of `myfun_fast`. Notice, how the first (function) argument of `minimum` disappeared and is replaced by a `do ... end` block, which defines the function to be used as first argument.

`minimum` is by now means special here. This syntax is general and works for *all* functions, which take a function as first argument, such as `map`, `filter`, `sum`, `minimum`, `maximum` ...

As usual, custom functions in julia are no different here:

In [None]:
function print_every_second(f, x)
    for i in 1:2:length(x)
        println(f(x[i]))
    end
end

x = [1, 2, 3, 4, 5 , 6]
print_every_second(x) do xi
    2xi
end

## Performance annotations

Julia features a number of annotation macros that disable costly checks or that enable more aggressive compiler optimisations. **These macros can lead to segfaults or code that produces wrong answers**, so they should **only be used locally** (i.e. only in well-localised pieces of code) and only if you know it has **no influence on the correctness** of a particular code segment.

### `@inbounds`

Disables checks for out-of-bounds access in arrays.

In [None]:
function f()
    x = [1; 5; 6]
    for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    x
end
@btime f();

In [None]:
function finbounds()
    x = [1; 5; 6]
    @inbounds for i in 1:100_000    
        for k in 1:3
            x[k] = x[k] + 2*x[k]
        end
    end
    x
end
@btime finbounds();

### `@simd`

Enables SIMD (single-instruction-multiple-data, vectorisation) instructions that are potentially **unsafe**. Most importantly this allows the compiler to *reorder* or *fuse* loop iterations.

In [None]:
function sum_elements(x)
    s = zero(eltype(x))
    for xi in x
        s += xi
    end
    s
end

x = rand(1000);
@btime sum_elements($x);

In [None]:
function sum_elements_simd(x)
    s = zero(eltype(x))
    @simd for xi in x
        s += xi
    end
    s
end

@btime sum_elements_simd($x);

In [None]:
sum_elements(x) ≈ sum_elements_simd(x)

*Note*: For integer addition both functions will yield the same runtime as integer addition is associative in finite precision, whereas addition in floating-point arithmetic is not.

### `@fastmath`

Julia also has a macro to locally turn on the infamous fast math optimisations. The very point of fast math is to **trade accuracy for speed**. So you can almost always be guaranteed to get worse-off accuracy (to various extends). Beyond mentioning that it exists I will not go into the details here. For a good blog article highlighting the issues and opportunities of `@fastmath` see https://simonbyrne.github.io/notes/fastmath/.

## Optimisation project 1

Optimize the following code.

(The type and size of the input is fixed/may not be changed.)

In [None]:
function work!(A, N)
    D = zeros(N, N)
    for i in 1:N
        D = b[i] * c * A
        b[i] = sum(D)
    end
    b
end

N = 100
A = rand(N,N)
b = rand(N)
c = 1.23

work!(A,N)

In [None]:
using BenchmarkTools
@btime work!($A, $N);

## Optimization project 2

Optimize the following function.

In [None]:
function work!(A, B, v, N)
    val = 0
    for i in 1:N
        for j in 1:N
            val = mod(v[i], 256)
            A[i, j] = B[i, j] * (sin(val) * sin(val) - cos(val) * cos(val))
        end
    end
    val
end

The (fixed) input is given by:

In [None]:
N = 4000
A = zeros(N, N)
B = rand(N, N)
v = rand(Int, N);

work!(A, B, v, N)

You can benchmark with the following code snippet. The larger the Mega-iterations per second the better!

In [None]:
using BenchmarkTools
runtime = @belapsed work!($A, $B, $v, $N);
perf = N * N * 1e-6 / runtime # MIt/s
println("Performance: $perf MIt/s")