# Class 7 - Performance

Today we'll talk a bit about performance, in particular how to write Julia code with performance in mind, and how to measure performance.  You can find a lot of useful material in Julia's documentation: [FAQ](http://docs.julialang.org/en/release-0.4/manual/faq/), [performance tips](http://docs.julialang.org/en/release-0.4/manual/performance-tips/), [profiling](http://docs.julialang.org/en/release-0.4/manual/profile/).

A few things that we'll pay attention to today are speed and memory allocation.  A few general heuristics are:
* Preallocation is good (don't grow arrays dynamically if avoidable)
* Type annotations are good (tell the compiler which types you want to instantiate)
* Avoid changing type of variables
* Write multiple function methods instead of multiple code paths in a function
* For-loops over vectorized notation (see [Victor's notebook](../../packages/devectorize.ipynb))

You've probably already used the `@time` macro, which is a very easy way to get an idea of speed and memory allocation

In [7]:
function bad_innerprod(x, y)
    @assert length(x) == length(y)
    ans = 0
    for i = 1:length(x)
        ans += x[i] * y[i]
    end
    return ans
end
;

In [8]:
n = 1000
x = randn(n)
y = randn(n)
;

In [11]:
@time bad_innerprod(x, y)

  

1.6477951308449887

0.000062 seconds (2.00 k allocations: 31.406 KB)


As you can see, the `@time` macro will tell you both time information and memory allocation information.  The function above is allocating several kilobytes of memory just to do some simple artihmetic.  This may not seem like a big deal, but imagine if this function was a very small component of a much larger program and was called thousands of times.  Here's a better example

In [12]:
function better_innerprod0{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    ans = zero(T)
    for i = 1:length(x)
        ans += x[i] * y[i]
    end
    return ans
end
;

In [15]:
@time better_innerprod0(x,y)

  

1.6477951308449887

0.000005 seconds (5 allocations: 176 bytes)


Essentially just by providing type information, we were able to keep the compiler from allocating unneccssary amounts of memory.

In [17]:
function better_innerprod1{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    ans = zero(T)
    for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    return ans
end
;

In [20]:
@time better_innerprod1(x, y)

  

1.6477951308449887

0.000004 seconds (5 allocations: 176 bytes)


The `@inbounds` macro is saying that the program doesn't need to check that we may try to access a memory location that isn't part of the array.  The complier may be able to infer this in this particular example, but if you have more complicated loops, the macro may give you a noticeable speedup.

Now, we separate the inner loop from bounds checking.  If you have more complex logic, breaking your functions into smaller components can speed up evaluation.

In [24]:
function fast_innerprod{T}(x::Array{T}, y::Array{T})
    ans::T = 0
    for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod2{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    fast_innerprod(x, y)
end
;

In [28]:
@time fast_innerprod(x, y)
@time better_innerprod2(x, y)

  

1.6477951308449887

0.000006 seconds (5 allocations: 176 bytes)
  0.000004 seconds (5 allocations: 176 bytes)


The `@simd` macro can be used in loops that can be vectorized.  This means no `break`s or `continue`s, and that the loop should not depend on previous loop evaluations.

In [35]:
function fast_innerprod2{T}(x::Array{T}, y::Array{T})
    ans::T = 0
    @simd for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod3{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    fast_innerprod2(x, y)
end
;

In [37]:
@time fast_innerprod2(x, y)
@time better_innerprod3(x, y)

  

1.647795130845014

0.000005 seconds (5 allocations: 176 bytes)
  0.000002 seconds (5 allocations: 176 bytes)


You can also use the equivalent of the `-ffast-math` compiler optimization flag with `@fastmath`.  Note that this may change the accuracy of your results, or give you an answer that is entirely wrong if you aren't careful.

In [55]:
function fast_innerprod3{T}(x::Array{T}, y::Array{T})
    ans::T = 0
    @fastmath @simd for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod4{T}(x::Array{T}, y::Array{T})
    @assert length(x) == length(y)
    fast_innerprod3(x, y)
end
;

In [57]:
@time fast_innerprod3(x, y)
@time better_innerprod4(x, y)

  

1.647795130845014

0.000007 seconds (5 allocations: 176 bytes)
  0.000002 seconds (5 allocations: 176 bytes)


Here's how you would compute an inner product with Julia's built in dot: (note that this is calling BLAS).

In [39]:
@time dot(x,y)

  

1.647795130845033

0.000005 seconds (5 allocations: 176 bytes)


Now, let's compare the efficiency of each implementation

In [48]:
function gflop_innerprod( n, reps )
    x = randn(n)
    y = randn(n)
    s = zero(Float64)
    time = @elapsed for j in 1:reps
        s+=bad_innerprod(x,y)
    end
    println("GFlop (bad_innerprod)      = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod0(x,y)
    end
    println("GFlop (better_innerprod0)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod1(x,y)
    end
    println("GFlop (better_innerprod1)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod2(x,y)
    end
    println("GFlop (better_innerprod2)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod3(x,y)
    end
    println("GFlop (better_innerprod3)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=better_innerprod4(x,y)
    end
    println("GFlop (better_innerprod4)  = ",2.0*n*reps/time*1E-9)
    time = @elapsed for j in 1:reps
        s+=dot(x,y)
    end
    println("GFlop (dot)                = ",2.0*n*reps/time*1E-9)
end

gflop_innerprod (generic function with 1 method)

In [60]:
gflop_innerprod(2000, 1000)
println("")
gflop_innerprod(2000, 1000)

GFlop (bad_innerprod)      = 0.05329956628144184
GFlop (better_innerprod0)  = 0.8921788479562192
GFlop (better_innerprod1)  = 1.3833305897276638
GFlop (better_innerprod2)  = 1.2420130796397417
GFlop (better_innerprod3)  = 2.6077302250927534
GFlop (better_innerprod4)  = 3.2919968265150596
GFlop (dot)                = 5.461556106565883

GFlop (bad_innerprod)      = 0.06489902279918897
GFlop (better_innerprod0)  = 1.4057777465382721
GFlop (better_innerprod1)  = 2.300171190240834
GFlop (better_innerprod2)  = 2.23589596823239
GFlop (better_innerprod3)  = 4.879066244302165
GFlop (better_innerprod4)  = 4.922561947365507
GFlop (dot)                = 6.402684005134953


# Exercise 1

* Switch the $+$ and $*$ operations in the definition of inner product, and write a function that will compute this modified function for you.  Make one version that is relatively inneficient and one version that is as fast as you can make it.
* Write a standard matrix-vector multiplication function on two arrays.  Can you get close to Julia's default implementation's performance?
* (if you have time) If you make the binary operations in the defintion of matrix-vector multiplication parameters of your function can you still get reasonable performance?

In [85]:
function bad_bool(x::Array{Bool},y::Array{Bool}, z::Array{Bool})
    ans::Bool = true
    for i = 1:length(x)
        ans = ans && (x[i] || y[i] || z[i]) 
    end
    return ans
end

bad_bool (generic function with 3 methods)

In [88]:
@time bad_bool(x,y,z)

  

false

0.000005 seconds (4 allocations: 160 bytes)


In [89]:
function good_bool(x::Array{Bool}, y::Array{Bool}, z::Array{Bool})
    ans::Bool = true
    @simd for i = 1:length(x)
        @inbounds ans = ans && (x[i] || y[i] || z[i]) 
    end
    return ans
end

good_bool (generic function with 2 methods)

In [93]:
@time good_bool(x,y,z)

  

false

0.000004 seconds (4 allocations: 160 bytes)


In [73]:
function timestep{T}( b::Vector{T}, a::Vector{T}, Δt::T )
    @assert length(a)==length(b)
    n = length(b)
    b[1] = 1                            # Boundary condition
    for i=2:n-1
        b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt
    end
    b[n] = 0                            # Boundary condition
end

function heatflow{T}( a::Vector{T}, nstep::Integer )
    b = similar(a)
    for t=1:div(nstep,2)                # Assume nstep is even
        timestep(b,a,T(0.1))
        timestep(a,b,T(0.1))
    end
end

heatflow(zeros(Float32,10),2)           # Force compilation
for trial=1:6
    a = zeros(Float32,1000)
    set_zero_subnormals(iseven(trial))  # Odd trials use strict IEEE arithmetic
    @time heatflow(a,1000)
end

  0.006194 seconds (1 allocation: 3.984 KB)
  0.003146 seconds (1 allocation: 3.984 KB)
  0.006999 seconds (1 allocation: 3.984 KB)
  0.003272 seconds (1 allocation: 3.984 KB)
  0.006223 seconds (1 allocation: 3.984 KB)
  0.003088 seconds (1 allocation: 3.984 KB)
