This week's ICME LA/Opt Seminar will be by one of the co-creators of Julia, Viral Shah 

* Title: On Machine Learning and Programming Languages
* Time: 4:30pm Thursday Feb 15, 2018
* Place: Y2E2, Room 101

# Performance

Today we'll talk a bit about performance, in particular how to write Julia code with performance in mind, and how to measure performance.  You can find a lot of useful material in Julia's documentation: [FAQ](https://docs.julialang.org/en/stable/manual/faq/), [performance tips](https://docs.julialang.org/en/stable/manual/performance-tips/), [profiling](http://docs.julialang.org/en/https://docs.julialang.org/en/stable/manual/profile//manual/profile/).

We won't talk too much about performance relative to other languages (see Julia's figure [here](http://julialang.org/benchmarks/), and look around the internet for criticisms), and mostly concern ourselves with writing fast code within Julia.  However, many of these topics apply directly or indirectly to many languages used for scientific computing, so keep an eye out for questions or connections to languages you are familiar with.

A few things that we'll pay attention to today are speed and memory allocation.  A few general heuristics are:
* Preallocation is good (don't grow arrays dynamically if avoidable)
* Type annotations are good (tell the compiler which types you want to instantiate)
* Avoid changing the type of variables
* Write multiple function methods instead of multiple code paths in a function
* Use for-loops over vectorized notation (we saw this last time with array broadcasts)

It is worth mentioning that if you don't want to worry about this sort of thing, that's OK.  One of the nice things about Julia is that you can use it at a high level without getting bogged down in this sort of analysis.  However, if you use certain functions a lot, plan on having others use your functions a lot, or want your simulations to finish faster, it may be worth taking a second pass at your code to find optimizations.  Also, if you practice a bit, you will also be able to write code faster the first time around.

## Notes on Timing

* Remember the first time you run a function it is "just in time compiled", meaning if you run it a second time you'll have a better idea of how fast the function actually is.

* The actual amount of time it takes to run a function depends on how fast you are able to schedule a function call on your machine.  Remember, you're not just running Julia on your computer, but also an operating system, and perhaps a variety of other applications, all sharing your processor's time and attention.  Usually the best way to make this effect negligable is to amortize it over many function runs by calling a function many times in succession.

* Your processor will have a big impact on performance.  Clock speed is the most obvious relevant variable, but architecture and compiler support for your arcitecture also matter (in this case LLVM support).

## Example: The Inner Product 

You've probably already used the `@time` macro, which is a very easy way to get an idea of speed and memory allocation.

In [11]:
function bad_innerprod(x, y)
    @assert length(x) == length(y)
    ans = 0
    for i = 1:length(x)
        ans += x[i] * y[i]
    end
    return ans
end
;

In [2]:
n = 100000
x = randn(n)
y = randn(n)
;

In [13]:
@time bad_innerprod(x, y)

  0.014963 seconds (300.00 k allocations: 4.578 MiB, 58.08% gc time)


-13.138399055370671

As you can see, the `@time` macro will tell you both time information and memory allocation information.  The function above is allocating several kilobytes of memory just to do some simple artihmetic.  This may not seem like a big deal, but imagine if this function was a very small component of a much larger program and was called thousands of times.  Here's a better example

In [5]:
function better_innerprod0{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    ans = zero(T)
    for i = 1:length(x)
        ans += x[i] * y[i]
    end
    return ans
end
;

In [7]:
@time better_innerprod0(x,y)

  0.000214 seconds (5 allocations: 176 bytes)


-13.138399055370671

Essentially just by providing type information, we were able to keep the compiler from allocating unneccssary amounts of memory.

In [14]:
function better_innerprod1{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    ans = zero(T)
    for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    return ans
end
;

In [17]:
@time better_innerprod1(x, y)

  0.000194 seconds (5 allocations: 176 bytes)


-13.138399055370671

The `@inbounds` macro is saying that the program doesn't need to check that we may try to access a memory location that isn't part of the array.  The complier may be able to infer this in this particular example, but if you have more complicated loops, the macro may give you a noticeable speedup.

Now, we separate the inner loop from bounds checking.  If you have more complex logic, breaking your functions into smaller components can speed up evaluation.

In [18]:
function fast_innerprod{T}(x::Array{T}, y::Array{T})
    ans::T = 0
    for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod2{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    fast_innerprod(x, y)
end
;

In [21]:
@time fast_innerprod(x, y)
@time better_innerprod2(x, y)

  0.000290 seconds (5 allocations: 176 bytes)
  0.000243 seconds (5 allocations: 176 bytes)


-13.138399055370671

The `@simd` macro ("Single Instruction, Multiple Data") can be used in loops that can be vectorized.  This means no `break`s or `continue`s, and that the loop should not depend on previous loop evaluations.  See more [here](https://en.wikipedia.org/wiki/SIMD).

In [22]:
function fast_innerprod2{T}(x::Array{T}, y::Array{T})
    ans::T = 0
    @simd for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod3{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    fast_innerprod2(x, y)
end
;

In [25]:
@time fast_innerprod2(x, y)
@time better_innerprod3(x, y)

  0.000282 seconds (5 allocations: 176 bytes)
  0.000210 seconds (5 allocations: 176 bytes)


-13.138399055370627

You can also use the equivalent of the `-ffast-math` compiler optimization flag with `@fastmath`.  Note that this may change the accuracy of your results, or give you an answer that is entirely wrong if you aren't careful.

In [26]:
function fast_innerprod3{T}(x::Array{T}, y::Array{T})
    ans::T = 0
    @fastmath @simd for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod4{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    fast_innerprod3(x, y)
end
;

In [28]:
@time fast_innerprod3(x, y)
@time better_innerprod4(x, y)

  0.000195 seconds (5 allocations: 176 bytes)
  0.000194 seconds (5 allocations: 176 bytes)


-13.138399055370627

Here's how you would compute an inner product with Julia's built in dot: (note that this is calling BLAS).

In [30]:
@time x' * y 
@time dot(x,y)



  0.000235 seconds (5 allocations: 176 bytes)
  0.000210 seconds (5 allocations: 176 bytes)


-13.138399055370783

Now, let's compare the efficiency of each implementation.  The `@elapsed` macro keeps track of how much time is spent in a for-loop

In [31]:
function gflop_innerprod(n, reps )
    x = randn(n)
    y = randn(n)
    s = zero(Float64)
    time = @elapsed for j in 1:reps
        s+=bad_innerprod(x,y)
    end
    println("GFlop (bad_innerprod)      = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod0(x,y)
    end
    println("GFlop (better_innerprod0)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod1(x,y)
    end
    println("GFlop (better_innerprod1)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod2(x,y)
    end
    println("GFlop (better_innerprod2)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod3(x,y)
    end
    println("GFlop (better_innerprod3)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod4(x,y)
    end
    println("GFlop (better_innerprod4)  = ",2.0*n*reps/time*1E-9)  
    time = @elapsed for j in 1:reps
        s+=dot(x,y)
    end
    println("GFlop (dot)                = ",2.0*n*reps/time*1E-9)
        time = @elapsed for j in 1:reps
        s+=x' * y
    end
    println("GFlop (array inner prod)   = ",2.0*n*reps/time*1E-9)
end

gflop_innerprod (generic function with 1 method)

In [33]:
gflop_innerprod(2000, 1000)
println("")
gflop_innerprod(2000, 1000)

GFlop (bad_innerprod)      = 0.056673317894482524
GFlop (better_innerprod0)  = 1.7688697497181969
GFlop (better_innerprod1)  = 1.625344319035586
GFlop (better_innerprod2)  = 1.7002378632770725
GFlop (better_innerprod3)  = 17.14765848723357
GFlop (better_innerprod4)  = 17.716361059438395
GFlop (dot)                = 15.644739260864295
GFlop (array inner prod)   = 15.762428675010248

GFlop (bad_innerprod)      = 0.06710325835975749
GFlop (better_innerprod0)  = 1.4106093338609014
GFlop (better_innerprod1)  = 1.7395244748919212
GFlop (better_innerprod2)  = 1.746145384064677
GFlop (better_innerprod3)  = 16.75182176061647
GFlop (better_innerprod4)  = 17.455434094826646
GFlop (dot)                = 14.994245958113575
GFlop (array inner prod)   = 13.67273622215462


## Exercise 1

* Create a function to compute the [smooth maximum](https://en.wikipedia.org/wiki/Smooth_maximum) of an array - `sum(x.*exp.(x))/sum(exp.(x))`  Make one version that is relatively inneficient and one version that is as fast as you can make it.
* (if you have time) If you make the binary operation (`+`) in the defintion of dot product a parameter of your function can you still get reasonable performance?   Try the [bitwise `xor`](https://docs.julialang.org/en/stable/manual/mathematical-operations/#Bitwise-Operators-1) on an array of ints.  Compare this to the [reduce function](https://docs.julialang.org/en/stable/stdlib/collections/#Base.reduce-Tuple{Any,Any,Any})



# More on Arrays 
## Fusing Dot Operations

Writing explicit for-loops is one way to make code fast.  Last week we saw how to broadcast a funciton using dot operations.  We can [fuse multiple vectorized functions](https://docs.julialang.org/en/stable/manual/performance-tips/#More-dots:-Fuse-vectorized-operations-1) using the macro `@.`  This is partly a convenience that lets you avoid writing a `.` after each function, but also insurance to make sure you get the most out of vectorization.

In [41]:
f(x) = 3x.^2 + 4x + 7x.^3
fdot(x) = @. 3x^2 + 4x + 7x^3
;

In [43]:
n = 10^6
x = rand(n)
@time f(x);
@time fdot(x);
@time f.(x);
@time map(f,x)
;

  0.031703 seconds (18 allocations: 53.406 MiB, 13.23% gc time)
  0.008312 seconds (6 allocations: 7.630 MiB)
  0.009413 seconds (30 allocations: 7.631 MiB, 9.31% gc time)
  0.003850 seconds (7 allocations: 7.630 MiB)


## Views

Views of arrays access a sub-array in memory (without making a copy).  If you want to perform an operation on a subarray, views can remove the cost of copying an array.  For more information see [here](https://docs.julialang.org/en/stable/manual/performance-tips/#Consider-using-views-for-slices-1)

In [44]:
fcopy(x) = sum(x[2:end-1])
@views fview(x) = sum(x[2:end-1])
;

In [46]:
n = 10^6
x = rand(n)

@time fcopy(x)
@time fview(x)
;

  0.003523 seconds (7 allocations: 7.630 MiB)
  0.001090 seconds (6 allocations: 224 bytes)


(look at total allocation size)

# Profiling Code

Julia has a [built-in profiler](https://docs.julialang.org/en/stable/manual/profile/#Profiling-1) which will allow you to see where your functions are spending most of their time.  If you have a script or function that is taking a long time to complete, this can help you identify where you should focus your optimization efforts.

In [47]:
function test_fn()
    A = randn(1000, 1000)
    b  = randn(1000)
    c = A * b
    maximum(c)
end

test_fn()

118.75479708557226

In [48]:
@profile test_fn()

117.06786076453068

In [49]:
Profile.print()

17 ./task.jl:335; (::IJulia.##14#17)()
 17 ...IJulia/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  17 .../Compat/src/Compat.jl:488; (::Compat.#inner#17{Array{Any,1},...
   17 ...rc/execute_request.jl:158; execute_request(::ZMQ.Socket, ::...
    17 .../Compat/src/Compat.jl:174; include_string(::Module, ::Strin...
     17 ./loading.jl:522; include_string(::String, ::String)
      17 ./<missing>:?; anonymous
       17 ./profile.jl:23; macro expansion
        16 ./In[47]:2; test_fn()
         1 ./random.jl:1378; randn!(::MersenneTwister, ::A...
         9 ./random.jl:1379; randn!(::MersenneTwister, ::A...
          2 ./random.jl:0; randn(::MersenneTwister, ::T...
          7 ./random.jl:1373; randn(::MersenneTwister, ::T...
           3 ./random.jl:1258; randn
            3 ./random.jl:340; rand_ui52
             3 ./random.jl:146; rand_ui52_raw
              3 ./random.jl:132; reserve_1
               1 ./random.jl:128; gen_rand
                1 ./dSFMT.jl:0; dsfmt_fill_array_close1_op..

The `@profile` macro will run the function several times, randomly interrupting the call and looking at the stack.  The first number in each line is the number of times the function was found on the call stack.  The rest of the line gives you information on the function and where to find it in the code base.  The output is indented based on where in the stack the function was found.

If you want to increase the number of samples, you can put your function in a for-loop as follows:

In [52]:
Profile.clear()

In [53]:
@profile for i = 1:100 test_fn() end

In [54]:
Profile.print()

1895 ./task.jl:335; (::IJulia.##14#17)()
 1895 ...Julia/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  1895 .../Compat/src/Compat.jl:488; (::Compat.#inner#17{Array{Any,1}...
   1895 ...c/execute_request.jl:158; execute_request(::ZMQ.Socket, :...
    1895 ...Compat/src/Compat.jl:174; include_string(::Module, ::Str...
     1895 ./loading.jl:522; include_string(::String, ::String)
      1895 ./<missing>:?; anonymous
       1895 ./profile.jl:23; macro expansion
        1895 ./In[53]:1; macro expansion
         1743 ./In[47]:2; test_fn()
          57   ./random.jl:1378; randn!(::MersenneTwister, ...
          1542 ./random.jl:1379; randn!(::MersenneTwister, ...
           292  ./random.jl:0; randn(::MersenneTwister, :...
           1088 ./random.jl:1373; randn(::MersenneTwister, :...
            421 ./random.jl:1258; randn
             421 ./random.jl:340; rand_ui52
              421 ./random.jl:146; rand_ui52_raw
               172 ./random.jl:145; rand_ui52_raw_inbounds
                172

If you'd like to go beyond the built in profiler, there's a package that will graphically interperet the results called [ProfileView](https://github.com/timholy/ProfileView.jl).  You can [track memory allocation](https://docs.julialang.org/en/stable/manual/profile/#Memory-allocation-analysis-1) for each line of code by starting up Julia with `--track-allocation=<setting>`.

## Exercise 2

* Use views and broadcasting to implement the [xor swap algorithm](https://en.wikipedia.org/wiki/XOR_swap_algorithm) on `X[inds]`, `Y[inds]`, where `X` and `Y` are arrays of the same type and size, and `inds` is a common subarray block. 
* Use Julia's profiler on the package of your choice.  What's taking the most time?  If you want a starting point, try PyCall.
* (if you have time) Try profiling your fast matrix types from HW 2.  Where would you focus your efforts if you wanted increased speed?

## More speed

### [Type Definitions](https://docs.julialang.org/en/stable/manual/performance-tips/#Type-declarations-1)

When you declare types, you should (whenever possible) make fields a concrete type, not even a specific abstract type.  If you want to allow for multiple types in the field, parameterize your type.  If there is any ambiguity in what the actual instantiated type will be, the compiler will not be able to allocate space correctly, and will generally miss out on optimizations.

For more about type stability, check out the [`@code_warntype` macro](https://docs.julialang.org/en/stable/manual/performance-tips/#man-code-warntype-1)


In [55]:
function swapsub!{T}(X::Array{T}, Y::Array{T}, inds)
    @views @. X[inds] = xor(X[inds], Y[inds])
    @views @. Y[inds] = xor(X[inds], Y[inds])
    @views @. X[inds] = xor(X[inds], Y[inds])
end

swapsub! (generic function with 1 method)

In [61]:
x = [1; 2; 3]
y = [4; 5; 6]
@time swapsub!(x, y, 1:2)
@show x
@show y
;

  0.000008 seconds (9 allocations: 384 bytes)
x = [4, 5, 3]
y = [1, 2, 6]


In [None]:
mutable struct AmbiguousType
    x
end

mutable struct StillAmbiguousType
    x::Real
end

mutable struct NonAmbiguousType
    x::Float64 
end

In [None]:
n = 2000
for T in (AmbiguousType, StillAmbiguousType, NonAmbiguousType)
    @time a = Array{T}(n)
    t1 = @elapsed for i=1:n
        a[i] = T(randn())
    end
    println("$t1 seconds to fill array")
    s = T(0)
    t2 = @elapsed for i=1:n
        s.x += a[i].x
    end
    println("$t2 seconds to sum array")
end

## Array Declaration

When you use arrays, you should pre-allocate if possible.  Specific type information is also valuable.

In [66]:
n = 10000
@time a1 = Real[] # Abstract type
@time a2 = Float64[] # specific type
@time a3 = Array{Real}(n) # pre-allocated abstract type
@time a4 = Array{Float64}(n) # pre-allocated specific type
;

  0.000009 seconds (5 allocations: 240 bytes)
  0.000005 seconds (5 allocations: 240 bytes)
  0.000052 seconds (13 allocations: 78.766 KiB)
  0.000036 seconds (13 allocations: 78.766 KiB)


In [68]:
t1 = @elapsed for i=1:n
    push!(a1,randn())
end
println("$t1 seconds to fill a1")
t2 = @elapsed for i=1:n
    push!(a2,randn())
end
println("$t2 seconds to fill a2")
t3 = @elapsed for i=1:n
    @inbounds a3[i] = randn()
end
println("$t3 seconds to fill a3")
t4 = @elapsed for i=1:n
    @inbounds a4[i] = randn()
end
println("$t4 seconds to fill a4")

0.001823978 seconds to fill a1
0.001704041 seconds to fill a2
0.001701424 seconds to fill a3
0.001548772 seconds to fill a4


### Subnormal Numbers

You can treat sub-normal numbers as zero.  If a number is less than what can be represented using floating point, your computer may still represent it, and incur performance penalites (although this is required for IEEE standards, so be careful).  See [Denormal Numbers](https://en.wikipedia.org/wiki/Denormal_number) on Wikipedia for more info. The following example is from [Julia's documentation](https://docs.julialang.org/en/stable/manual/performance-tips/#treat-subnormal-numbers-as-zeros), and models the heat equation.  

In [70]:
function timestep{T}( b::Vector{T}, a::Vector{T}, Δt::T )
    @assert length(a)==length(b)
    n = length(b)
    b[1] = 1                            # Boundary condition
    for i=2:n-1
        b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt
    end
    b[n] = 0                            # Boundary condition
end

function heatflow{T}( a::Vector{T}, nstep::Integer )
    b = similar(a)
    for t=1:div(nstep,2)                # Assume nstep is even
        timestep(b,a,T(0.1))
        timestep(a,b,T(0.1))
    end
end

heatflow(zeros(Float32,10),2)           # Force compilation
for trial=1:6
    a = zeros(Float32,1000)
    set_zero_subnormals(iseven(trial))  # Odd trials use strict IEEE arithmetic
    @time heatflow(a,1000)
end

  0.003427 seconds (1 allocation: 4.063 KiB)
  0.003120 seconds (1 allocation: 4.063 KiB)
  0.004569 seconds (1 allocation: 4.063 KiB)
  0.002824 seconds (1 allocation: 4.063 KiB)
  0.004497 seconds (1 allocation: 4.063 KiB)
  0.001862 seconds (1 allocation: 4.063 KiB)


### [Access arrays in column-major order](https://docs.julialang.org/en/stable/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1)
If you need to loop over an array, keep in mind that it is stored in column-major format, so looping over indices in reverse order will allow you to use blocks of memory more efficiently.

In [71]:
# access in Column-major order
function sum_array1{T}(A::Array{T,3})
    s::T = 0
    @simd for k=1:size(A,3)
        @simd for j=1:size(A,2)
            @simd for i=1:size(A,1)
                @inbounds s += A[i,j,k]
            end
        end
    end
    return s
end

# access in Row-major order
function sum_array2{T}(A::Array{T,3})
    s::T = 0
    @simd for i=1:size(A,1)
        @simd for j=1:size(A,2)
            @simd for k=1:size(A,3)
                @inbounds s += A[i,j,k]
            end
        end
    end
    return s
end
;

In [73]:
n = 300
A = rand(Int64,n,n,n)
@time sum_array1(A)
@time sum_array2(A)
;

  0.020375 seconds (5 allocations: 176 bytes)
  0.306226 seconds (5 allocations: 176 bytes)


In [74]:
A = [1 2; 3 4]
@show A
@show A[:]
;

A = [1 2; 3 4]
A[:] = [1, 3, 2, 4]


### [Minor Tweaks](https://docs.julialang.org/en/stable/manual/performance-tips/#Tweaks-1)

Julia's performance documentation suggests the following optimizations for making very fast inner loops:

* Avoid unnecessary arrays. For example, instead of sum([x,y,z]) use x+y+z.
* Use `abs2(z)` instead of `abs(z)^2` for complex z. In general, try to rewrite code to use `abs2()` instead of `abs()` for complex arguments. (This would be useful for writing fast Julia Set codes!)
* Use `div(x,y)` for truncating division of integers instead of `trunc(x/y)`, `fld(x,y)` instead of `floor(x/y)`, and `cld(x,y)` instead of `ceil(x/y)`.

In [77]:
n = 100000
@time z = Array{Complex{Float64}}(n)
t = @elapsed for i = 1:n
    @inbounds z[i] = Complex{Float64}(randn(),randn())
end
println("$t sec to initialize")

  0.005037 seconds (13 allocations: 1.527 MiB, 97.21% gc time)
0.016874269 sec to initialize


In [79]:
bnd = 0.5
s = zero(Int64)
t = @elapsed for i = 1:n
    @inbounds s += (abs(z[i]) > bnd ? one(Int64) : zero(Int64))
end
println("$t sec using abs()")
bnd2 = bnd*bnd
s = zero(Int64)
t = @elapsed for i = 1:n
    @inbounds s += (abs2(z[i]) > bnd2 ? one(Int64) : zero(Int64))
end
println("$t sec using abs2()")

0.024997542 sec using abs()
0.021061187 sec using abs2()


## Exercise 3

* Write a standard matrix-vector multiplication function on two arrays.  Can you get close to the default implementation's performance (BLAS gemv)?
* How would you modify your routine to do a matrix transpose-vector multiplication routine? 
* Why do you think [BLAS's gemv](http://www.netlib.org/lapack/explore-html/dc/da8/dgemv_8f.html) takes the arguments that it does?

## Additional Performance Analysis

* [Lint](https://github.com/tonyhffong/Lint.jl) - analyze code for potential improvements