# Linear Algebra, Precision and Profiling

Julias generic way of implementing algorithms often makes it easy to explore different storage schemes, elevated or reduced precision or to try acceleration hardware like a GPU. I want to present a few illustrating examples on a real-world iterative algorithm to show you how little effort is needed to give these things a try in Julia. I will also show in one example how one can track performance by profiling and understand what should be done to improve an algorithm at hand.

## Linear Algebra

For dense and sparse arrays, all important linear algebra routines are available in the `LinearAlgebra`. This includes common tasks such as
- `qr` (also pivoted)
- `cholesky` (also pivoted)
- `eigen`, `eigvals`, `eigvecs` (compute eigenpairs, values, vectors)
- `factorize` (for computing matrix factorisations)
- `inv` (invert a matrix)

All these methods are both implemented for generic matrices (all `AbstractMatrices`) and specialised for specific kinds. For example `factorize` is intended to compute a clever factorisation for solving linear systems. What it does depends on the matrix properties:

In [None]:
using LinearAlgebra
using SparseArrays

In [None]:
# Random real matrix -> will do an LU
A = randn(10, 10)
@show typeof(factorize(A))

# Real-symmetric matrix ->  will do a Bunch-Kaufman
Am = Symmetric(A + A')
@show typeof(factorize(Am))

# Symmetric tridiagonal -> will do a LDLt
Am = SymTridiagonal(A + A')
@show typeof(factorize(Am))

# Random sparse matrix -> will do sparse LU
S = sprandn(50, 50, 0.3)
@show typeof(factorize(S))

# ... and so on ...

The all share a common interface, such that an algorithm like

In [None]:
function solve_many(A, xs)
    F = factorize(A)
    [F \ rhs for rhs in xs]
end

will automatically work for sparse arrays and dense arrays and is furthermore independent of the floating-point type.

##### More details
- https://docs.julialang.org/en/v1/stdlib/LinearAlgebra/

## Use case: A generic Davidson

Let's try this in a more realistic algorithm.
A simple Davidson algorithm can be implemented quite concisely:

In [None]:
using LinearAlgebra

qrortho(X::Array)   = Array(qr(X).Q)
qrortho(X, Y)       = qrortho(X - Y * Y'X)

function rayleigh_ritz(X::Array, AX::Array, N)
    F = eigen(Hermitian(X'AX))
    F.values[1:N], F.vectors[:,1:N]
end

function davidson(A, SS::AbstractArray; tol=1e-5, maxsubspace=8size(SS, 2), verbose=true)
    m = size(SS, 2)
    for i in 1:100
        Ass = A * SS
        rvals, rvecs = rayleigh_ritz(SS, Ass, m)
        Ax = Ass * rvecs

        R = Ax - SS * rvecs * Diagonal(rvals)
        if norm(R) < tol
            return rvals, SS * rvecs
        end

        verbose && println(i, "  ", size(SS, 2), "  ", norm(R))

        # Use QR to orthogonalise the subspace.
        if size(SS, 2) + m > maxsubspace
            SS = qrortho([SS*rvecs R])
        else
            SS = qrortho([SS       R])
        end
    end
    error("not converged.")
end

In [None]:
nev = 2
A = randn(50, 50); A = A + A' + 5I;

# Generate two random orthogonal guess vectors
x0 = qrortho(randn(size(A, 2), nev))

# Run the problem
davidson(A, x0)

In [None]:
# Mixed precision!
using GenericLinearAlgebra

λ, v = davidson(Matrix{Float32}(A), Float32.(x0), tol=1e-3)
println()
λ, v = davidson(Matrix{Float64}(A), v, tol=1e-13)
println()
λ, v = davidson(Matrix{BigFloat}(A), v, tol=1e-25)
λ

In [None]:
using SparseArrays
nev = 2
spA = sprandn(100, 100, 0.3); spA = spA + spA' + 2I

spA

In [None]:
spx0 = randn(size(spA, 2), nev)
spx0 = Array(qr(spx0).Q)

davidson(spA, spx0, tol=1e-6)

In [None]:
# ... runs with GPUs !
using CUDA

qrortho(X::CuArray) = CuArray(qr(X).Q)
function rayleigh_ritz(X::CuArray, AX::CuArray, N)
    values, vectors = CUDA.CUSOLVER.syevd!('V', 'U', X'AX)
    values[1:N], vectors[:, 1:N]
end

In [None]:
davidson(cu(A), cu(x0))

... but actually the performance is overall not that good out of the box, because we're doing a lot of copying and elementwise access in our naive algorithm, which is especially bad for the GPU version.

## Iterative methods

Instead of implementing iterative methods such as the Davidson diagonalisation ourselves, we can also build upon existing packages for standard linear algebra, such as [IterativeSolvers.jl](https://github.com/JuliaLinearAlgebra/IterativeSolvers.jl), [KrylovKit.jl](https://github.com/Jutho/KrylovKit.jl), [Krylov.jl](https://github.com/JuliaSmoothOptimizers/Krylov.jl).

For example instead of hand-coding a Davidson, we could use IterativeSolvers' LOBPCG implementation.

In [None]:
using IterativeSolvers

In [None]:
largest = false
IterativeSolvers.lobpcg(A, largest, x0)

which works seamlessly with GPUs as well:

In [None]:
largest = false
IterativeSolvers.lobpcg(cu(A), largest, cu(x0))

## Profiling and timing measurements

Now how would one go about improving this piece of code?

The best way forward is to obtain an idea *where* the computational time is spent and then think where we could *locally* improve. We already saw the `@btime` macro (from [BenchmarkTools.jl](https://github.com/JuliaCI/BenchmarkTools.jl) for getting accurate timing measurements on single instructions. Let's see what other options there are.

For our tests we will use this piece of code:

In [None]:
function myfunction(n)
    for i = 1:n
        A = randn(100, 100, 20)
        m = maximum(A)
        Am = mapslices(sum, A; dims=2)  # 
        B = A[:, :, 5]
        Bsort = mapslices(sort, B; dims=1)  #
        b = rand(100)
        C = B .* b
    end
end

### Profiling

To profile this piece of code we will use Julia's builtin `Profile` package in combination with `ProfileView` as a grapical viewer. Some Julia editors (like VSCode) also have their own plugins to integrate with Julia's profiling capabilities, so worth to look out for this in your favourite editor!

In [None]:
using Profile
using ProfileView

In [None]:
# Run once to compile everything ... this should be ignored
ProfileView.@profview myfunction(1);

In [None]:
ProfileView.@profview myfunction(10);

**Note:** ProfileView does not always work so well with Jupyter Notebooks or Jupyterlab (but it's great from the REPL). An alternative is ProfileSVG:

In [None]:
using ProfileSVG

In [None]:
ProfileSVG.@profview myfunction(1)
ProfileSVG.@profview myfunction(10)

So how should one interpret this?
- The horizontal direction is the time spent.
- The vertical diretcion is the depth of the call stack.

What do we learn:
- The `mapslices` calls are clearly the most expensive parts of the function we should worry most
- The first call (`mapslices(sum, A; dims=2)`) is more expensive as it works on more data than `mapslices(sort, B; dims=1)`
- There is a stack of calls to functions in sort.jl on the right. This is because in Julia sorting is implemented recursively (sort functions call themselves)

It is worth noting that red is a special colour in these graphs, highlighting a runtime dispatch, which can be an indicator for a type instability (more details and how to cure this in [16_Performance_Engineering.ipynb](16_Performance_Engineering.ipynb).

For more details, take a look at the [ProfileView](https://github.com/timholy/ProfileView.jl) website.

### High-level timings in TimerOutputs.jl

[TimerOutputs.jl](https://github.com/KristofferC/TimerOutputs.jl) is great package to get a rough overview where time is spent. The idea is to annotate the code with simple tags, where timings are taken while the code is running. This is not for free, but if done at a high level cheap enough to be "always on":

In [None]:
using TimerOutputs

const to = TimerOutput()
function annotated_function(n)
    @timeit to "loop" for i = 1:n
        @timeit to "initialisation" A = randn(100,100,20)
        m = maximum(A)
        @timeit to "mapslices on A" Am = mapslices(sum, A; dims=2)
        B = A[:,:,5]
        @timeit to "mapslices on B" Bsort = mapslices(sort, B; dims=1)
        b = rand(100)
        C = B.*b
    end
end

In [None]:
reset_timer!(to)
annotated_function(10)
to

## Changing BLAS/LAPACK implementation

Since linear algebra are such basic building blocks across high-performance codes, basic algorithms from BLAS and LAPACK (such as matrix multiplication, factorisations, solving linear systems, diagonalisation) are available in nummerous implementations.

These implementations are sometimes vendor-provided and feature **custom machine-specific** performance optimisations. Sometimes (e.g. on supercomputers) this might even be specific to the exact hardware used in the computer one is currently running on.

By default Julia uses [OpenBLAS](https://github.com/xianyi/OpenBLAS) for all linear algebra operations, which is indeed a good safe default. However, **to get the most** out of the hardware one is running on it is crucial to make use of the best-matching, vendor-specific implementation.

Ideally one would therefore be able to easily switch between BLAS/LAPACK libraries, potentially even depending on the very machine one is running on. The solution in the Julia world to this problem is:

### [Libblastrampoline](https://github.com/JuliaLinearAlgebra/libblastrampoline)

Libblastrampoline is a *BLAS and LAPACK demuxing library*, which allows to switch **at runtime** between BLAS/LAPACK implementations. Practically, one uses small wrapper packages like

* [MKL.jl](https://github.com/JuliaLinearAlgebra/MKL.jl)
* [BLISBLAS.jl](https://github.com/carstenbauer/BLISBLAS.jl)

For example if you want to switch to MKL, this is as easy as:

In [None]:
using LinearAlgebra
BLAS.get_config()

In [None]:
BLAS.set_num_threads(4)

In [None]:
using BenchmarkTools
A = rand(100, 100)
B = rand(100, 100)
@btime $A * $B;

In [None]:
using MKL
BLAS.get_config()

In [None]:
BLAS.set_num_threads(4)

In [None]:
@btime $A * $B;

##### More details
- https://docs.julialang.org/en/v1/manual/profile/
- https://github.com/kimikage/ProfileSVG.jl
- https://github.com/JuliaCI/BenchmarkTools.jl
- https://github.com/KristofferC/TimerOutputs.jl

## Takeaways

- Julia has extensive builtin support for standard dense linear algebra (SVG, Diagonalisation, Linear systems)
- Plenty of packages complement this by iterative methods
- Making use of a matching BLAS/LAPACK library (e.g. MKL) takes zero effort, but can bring noteworthy savings
- Profiling and timing can take place at various levels (From benchmarking individual statements to profiling a whole code)