# Performance

Today we'll talk a bit about performance, in particular how to write Julia code with performance in mind, and how to measure performance.  You can find a lot of useful material in Julia's documentation: , [performance tips](https://docs.julialang.org/en/v1/manual/performance-tips/index.html), [profiling](https://docs.julialang.org/en/v1/manual/profile/).

We won't talk too much about performance relative to other languages (see Julia's figure [here](http://julialang.org/benchmarks/), and look around the internet for criticisms), and mostly concern ourselves with writing fast code within Julia.  However, many of these topics apply directly or indirectly to many languages used for scientific computing, so keep an eye out for questions or connections to languages you are familiar with.

A few things that we'll pay attention to today are speed and memory allocation.  A few general heuristics are:
* Preallocation is good (don't grow arrays dynamically if avoidable)
* Type annotations are good (tell the compiler which types you want to instantiate)
* Avoid changing the type of variables
* Write multiple function methods instead of multiple code paths in a function
* It is (usually) faster to write for loops  than to use vectorization if you're doing anything complicated
* When modifying a global variable with a complicated operation, update a local variable instead and update the global variable at the end

It is worth mentioning that if you don't want to worry about this sort of thing, that's OK.  One of the nice things about Julia is that you can use it at a high level without getting bogged down in this sort of analysis.  However, if you use certain functions a lot, plan on having others use your functions a lot, or want your simulations to finish faster, it may be worth taking a second pass at your code to find optimizations.  Also, if you practice a bit, you will also be able to write code faster the first time around.

## Notes on Timing

* Remember the first time you run a function it is "just in time compiled", meaning if you run it a second time you'll have a better idea of how fast the function actually is.

* The actual amount of time it takes to run a function depends on how fast you are able to schedule a function call on your machine.  Remember, you're not just running Julia on your computer, but also an operating system, and perhaps a variety of other applications, all sharing your processor's time and attention.  Usually the best way to make this effect negligable is to amortize it over many function runs by calling a function many times in succession.

* Since the overhead induced by your operating system is essentially random, sometimes a few tests with @time will indicate that a change to the code makes it faster, even if it is actually a performance regression. To get around this issue we will be using `BenchmarkTools.jl` to test our code for this lecture. You can install it with `using Pkg ; Pkg.add("BenchmarkTools") ; using BenchmarkTools`.

* Your processor will have a big impact on performance.  Clock speed is the most obvious relevant variable, but architecture and compiler support for your arcitecture also matter (in this case LLVM support).

## Example: The Inner Product

Here is a fairly straightforward way to implement the inner product of two vectors:

In [3]:
function slow_innerprod(x, y)
    @assert length(x) == length(y)
    ans = 0
    for i = 1:length(x)
        ans += x[i] * y[i]
    end
    return ans
end
;

In [4]:
using BenchmarkTools
n = 10000
x = randn(n)
y = randn(n)
;

In [15]:
@time slow_innerprod(x,y);

  0.000031 seconds (5 allocations: 176 bytes)


In [16]:
@benchmark slow_innerprod(x, y)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     13.203 μs (0.00% GC)
  median time:      13.759 μs (0.00% GC)
  mean time:        15.964 μs (0.00% GC)
  maximum time:     98.742 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

As you can see, both the `@time` and `@benchmark` macros will tell you both time information and memory allocation information. Although we did not provide Julia any typing information in this example, Julia was able to infer that we were doing operations on floating-point numbers at runtime and avoid allocating a large amount of memory. However on more complicated functions Julia will not be able to do this and consequently our performance will suffer as a consequence.

In [17]:
function better_innerprod(x::Array{T}, y::Array{T})::T where T
    @assert length(x) == length(y)
    ans = zero(T)
    for i = 1:length(x)
        ans += x[i] * y[i]
    end
    return ans
end
;

In [18]:
@benchmark better_innerprod(x,y)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.649 μs (0.00% GC)
  median time:      11.703 μs (0.00% GC)
  mean time:        13.946 μs (0.00% GC)
  maximum time:     52.999 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Essentially just by providing type information, we were able to keep the compiler from allocating unneccssary amounts of memory, and obtain a slight performance boost. This is not that impressive in the case of this simple dot product function, but for more complicated programs Julia's compiler will not be able to infer types as effectively-- the difference will be more significant.

Here is an example of a more complicated function where type information helps significantly:

In [19]:
function merge_slow(x,y)
    output = Array{Any}(undef,length(x)+length(y))
    x_ind = 1; y_ind = 1
    output_ind = 1
    while x_ind <= length(x) || y_ind <= length(y)
        if y_ind > length(y)
            output[output_ind] = x[x_ind]
            x_ind += 1
        elseif x_ind > length(x) || x[x_ind] > y[y_ind]
            output[output_ind] = y[y_ind]
            y_ind += 1
        else
            output[output_ind] = x[x_ind]
            x_ind += 1
        end
        output_ind += 1
    end
    return output
end
     
function mergesort_slow(x)
    if length(x) == 1 
        return x
    end
    split = div(length(x),2)
    y = mergesort_slow(x[1:split])
    z = mergesort_slow(x[split+1:end])
    return merge_slow(y,z)
end

mergesort_slow (generic function with 1 method)

In [20]:
v = rand(10000)
@benchmark mergesort_slow(v)

BenchmarkTools.Trial: 
  memory estimate:  4.69 MiB
  allocs estimate:  42969
  --------------
  minimum time:     5.846 ms (0.00% GC)
  median time:      7.681 ms (0.00% GC)
  mean time:        8.129 ms (8.52% GC)
  maximum time:     88.815 ms (90.31% GC)
  --------------
  samples:          614
  evals/sample:     1

In [25]:
function merge_fast1(x::Array{Float64,1},y::Array{Float64,1})::Array{Float64,1}
    output = Array{Float64}(undef,length(x)+length(y))
    x_ind = 1; y_ind = 1
    output_ind = 1
    while x_ind <= length(x) || y_ind <= length(y)
        if y_ind > length(y)
            output[output_ind] = x[x_ind]
            x_ind += 1
        elseif x_ind > length(x) || x[x_ind] > y[y_ind]
            output[output_ind] = y[y_ind]
            y_ind += 1
        else
            output[output_ind] = x[x_ind]
            x_ind += 1
        end
        output_ind += 1
    end
    return output
end
     
function mergesort_fast1(x::Array{Float64,1})::Array{Float64,1}
    if length(x) == 1 
        return x
    end
    split = div(length(x),2)
    y = mergesort_fast1(x[1:split])
    z = mergesort_fast1(x[split+1:end])
    return merge_fast1(y,z)
end

mergesort_fast1 (generic function with 1 method)

In [27]:
v = rand(10000)
@benchmark mergesort_fast1(v)

BenchmarkTools.Trial: 
  memory estimate:  4.49 MiB
  allocs estimate:  30010
  --------------
  minimum time:     2.745 ms (0.00% GC)
  median time:      3.576 ms (0.00% GC)
  mean time:        4.188 ms (13.65% GC)
  maximum time:     108.041 ms (96.85% GC)
  --------------
  samples:          1190
  evals/sample:     1

Inferring the type `T` rather than using `Any` vectors gives a significant speedup.

Back to inner products...

In [28]:
function betterer_innerprod(x::Array{T}, y::Array{T})::T where T
    @assert length(x) == length(y)
    ans = zero(T)
    for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    return ans
end
;

In [29]:
@benchmark betterer_innerprod(x, y)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     11.625 μs (0.00% GC)
  median time:      11.649 μs (0.00% GC)
  mean time:        14.641 μs (0.00% GC)
  maximum time:     77.573 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The `@inbounds` macro is saying that the program doesn't need to check that we may try to access a memory location that isn't part of the array.  The complier may be able to infer this in this particular example, but if you have more complicated loops, the macro may give you a noticeable speedup.

Now, we separate the inner loop from bounds checking.  If you have more complex logic, breaking your functions into smaller components can speed up evaluation.

In [30]:
function fast_innerprod(x::Array{T}, y::Array{T})::T where T
    ans::T = 0
    for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    return ans
end

function better_innerprod1(x::Array{T}, y::Array{T})::T where T
    @assert length(x) == length(y)
    return fast_innerprod(x, y)
end
;

In [None]:
@benchmark better_innerprod1(x, y)

In [31]:
function faster_innerprod(x::Array{T}, y::Array{T})::T where T
    ans::T = 0
    @simd for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    return ans
end

function better_innerprod2(x::Array{T}, y::Array{T})::T where T
    @assert length(x) == length(y)
    return faster_innerprod(x, y)
end
;

In [32]:
@benchmark better_innerprod2(x, y)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     2.447 μs (0.00% GC)
  median time:      2.571 μs (0.00% GC)
  mean time:        2.755 μs (0.00% GC)
  maximum time:     19.259 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

The `@simd` macro ("Single Instruction, Multiple Data") can be used in loops that can be vectorized.  This means no `break`s or `continue`s, and that the loop should not depend on previous loop evaluations.  See more [here](https://en.wikipedia.org/wiki/SIMD). As you can see the improvement from vectorizing our code dwarfs anything else we could do for this simple example. Vectorization is powerful!

You can also use the equivalent of the `-ffast-math` compiler optimization flag with `@fastmath`.  Note that this may change the accuracy of your results, or give you an answer that is entirely wrong if you aren't careful.

In [36]:
function fastest_innerprod(x::Array{T}, y::Array{T})::T where T
    ans::T = 0
    @fastmath @simd for i = 1:length(x)
        @inbounds ans += x[i] * y[i]
    end
    ans
end

function better_innerprod3(x::Array{T}, y::Array{T})::T where T
    @assert length(x) == length(y)
    fastest_innerprod(x, y)
end
;

In [37]:
@benchmark better_innerprod3(x, y)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     2.423 μs (0.00% GC)
  median time:      2.611 μs (0.00% GC)
  mean time:        3.058 μs (0.00% GC)
  maximum time:     15.379 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

Here's how you would compute an inner product with Julia's built in dot: (note that this is calling BLAS).

In [33]:
using LinearAlgebra

In [34]:
@benchmark x' * y 

BenchmarkTools.Trial: 
  memory estimate:  32 bytes
  allocs estimate:  2
  --------------
  minimum time:     2.586 μs (0.00% GC)
  median time:      2.738 μs (0.00% GC)
  mean time:        3.187 μs (0.00% GC)
  maximum time:     62.792 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

In [35]:
@benchmark dot(x,y)

BenchmarkTools.Trial: 
  memory estimate:  16 bytes
  allocs estimate:  1
  --------------
  minimum time:     2.493 μs (0.00% GC)
  median time:      2.627 μs (0.00% GC)
  mean time:        3.016 μs (0.00% GC)
  maximum time:     14.121 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

As you can see, our implementation performs roughly as well as Julia's native functions for inner products.

## Exercise 1

* Modify the slowest and fastest versions of our dot product function to write a function which sums the entries of a vector. Compare their performances with `@benchmark`.
* Create a function to compute the [smooth maximum](https://en.wikipedia.org/wiki/Smooth_maximum) of an array - `sum(x.*exp.(x))/sum(exp.(x))`  Make one version that is relatively inefficient and one version that is as fast as you can make it.
* (if you have time) If you make the binary operation (`+`) in the defintion of dot product a parameter of your function can you still get reasonable performance?   Try the [bitwise `xor`](https://docs.julialang.org/en/stable/manual/mathematical-operations/#Bitwise-Operators-1) on an array of ints.  Compare this to the [reduce function](https://docs.julialang.org/en/stable/stdlib/collections/#Base.reduce-Tuple{Any,Any,Any})



# More on Arrays 
## Fusing Dot Operations

Writing explicit for-loops is one way to make code fast.  Last week we saw how to broadcast a funciton using dot operations.  We can [fuse multiple vectorized functions](https://docs.julialang.org/en/stable/manual/performance-tips/#More-dots:-Fuse-vectorized-operations-1) using the macro `@.`  This is partly a convenience that lets you avoid writing a `.` after each function, but also insurance to make sure you get the most out of vectorization.

In [38]:
f(x) = 3x.^2 + 4x + 7x.^3
fdot(x) = @. 3x^2 + 4x + 7x^3
;

In [40]:
n = 10^6
x = rand(n)
@time f(x);
@time fdot(x);
@time f.(x);
@time map(f,x)
;

  0.069990 seconds (17 allocations: 45.777 MiB, 31.45% gc time)
  0.020267 seconds (6 allocations: 7.630 MiB)
  0.010630 seconds (8 allocations: 7.630 MiB)
  0.010025 seconds (7 allocations: 7.630 MiB)


## Views

Views of arrays access a sub-array in memory (without making a copy).  If you want to perform an operation on a subarray, views can remove the cost of copying an array.  For more information see [here](https://docs.julialang.org/en/stable/manual/performance-tips/#Consider-using-views-for-slices-1)

In [41]:
fcopy(x) = sum(x[2:end-1])
@views fview(x) = sum(x[2:end-1])
;

In [44]:
n = 10^6
x = rand(n)

@time fcopy(x)
@time fview(x)
;

  0.003526 seconds (7 allocations: 7.630 MiB)
  0.000644 seconds (6 allocations: 224 bytes)


(look at total allocation size)

# Scoping

Another thing to be aware of is how `global` scope in Julia affects your code's performance. While read-only access to `global` variables from a function is allowed, type inference on these does not work: Julia in `local` scope treats all `globals` as `Any` type. You can get around this issue by using the keyword `const`: this allows Julia to infer the type properly. However, the scoping rules in notebooks are different from those on the REPL and in scripts, so you'll have to take my word for this.

# Profiling Code

Julia has a [built-in profiler](https://docs.julialang.org/en/stable/manual/profile/#Profiling-1) which will allow you to see where your functions are spending most of their time.  If you have a script or function that is taking a long time to complete, this can help you identify where you should focus your optimization efforts.

In [45]:
using Profile
function test_fn()
    A = randn(1000, 1000)
    b  = randn(1000)
    c = A * b
    maximum(c)
end

test_fn()

111.64010561541913

In [46]:
@profile test_fn()

108.84081323969548

In [47]:
Profile.print()

17 ./task.jl:268; (::getfield(IJulia, Symbol("##15#18...
 17 .../fRegO/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  17 ./essentials.jl:789; invokelatest
   17 ./essentials.jl:790; #invokelatest#1
    16 ...rc/execute_request.jl:67; execute_request(::ZMQ.Socket, ::I...
     16 ...c/SoftGlobalScope.jl:218; softscope_include_string(::Modul...
      16 ./boot.jl:330; eval
       11 ./In[45]:3; test_fn()
        11 ...andom/src/normal.jl:190; randn
         11 ...andom/src/normal.jl:184; randn(::Random.MersenneTwister...
          4 ./boot.jl:419; Type
           4 ./boot.jl:406; Type
          1 ...andom/src/normal.jl:0; randn!
          6 ...andom/src/normal.jl:173; randn!
           1 ...ndom/src/normal.jl:45; randn(::Random.MersenneTwiste...
           4 ...ndom/src/normal.jl:167; randn(::Random.MersenneTwiste...
            1 ./int.jl:51; randn
            1 ./int.jl:438; randn
            1 ...ndom/src/normal.jl:40; randn
             1 ...dom/src/Random.jl:230; rand
              1 

The `@profile` macro will run the function several times, randomly interrupting the call and looking at the stack.  The first number in each line is the number of times the function was found on the call stack.  The rest of the line gives you information on the function and where to find it in the code base.  The output is indented based on where in the stack the function was found.

If you want to increase the number of samples, you can put your function in a for-loop as follows:

In [51]:
Profile.clear()

In [52]:
@profile for i = 1:100 test_fn() end

In [53]:
Profile.print()

923 ./task.jl:268; (::getfield(IJulia, Symbol("##15#1...
 923 ...fRegO/src/eventloop.jl:8; eventloop(::ZMQ.Socket)
  923 ./essentials.jl:789; invokelatest
   923 ./essentials.jl:790; #invokelatest#1
    923 ...rc/execute_request.jl:67; execute_request(::ZMQ.Socket, ::I...
     923 ...c/SoftGlobalScope.jl:218; softscope_include_string(::Modu...
      923 ./boot.jl:330; eval
       923 ./In[52]:1; top-level scope
        923 ...file/src/Profile.jl:25; macro expansion
         922 ./In[52]:1; macro expansion
          881 ./In[45]:3; test_fn()
           881 ...dom/src/normal.jl:190; randn
            881 ...dom/src/normal.jl:184; randn(::Random.MersenneTwis...
             25  ./boot.jl:419; Type
              25 ./boot.jl:406; Type
             28  ...dom/src/normal.jl:0; randn!
             828 ...dom/src/normal.jl:173; randn!
              95  ./array.jl:766; setindex!
              26  ...om/src/normal.jl:43; randn(::Random.MersenneTwi...
              3   ...om/src/normal.jl:44; ran

If you'd like to go beyond the built in profiler, there's a package that will graphically interperet the results called [ProfileView](https://github.com/timholy/ProfileView.jl).  You can [track memory allocation](https://docs.julialang.org/en/stable/manual/profile/#Memory-allocation-analysis-1) for each line of code by starting up Julia with `--track-allocation=<setting>`.

## Exercise 2

* Use views and broadcasting to implement the [xor swap algorithm](https://en.wikipedia.org/wiki/XOR_swap_algorithm) on `X[inds]`, `Y[inds]`, where `X` and `Y` are arrays of the same type and size, and `inds` is a common subarray block. 
* Use Julia's profiler on the package of your choice.  What's taking the most time?  If you want a starting point, try PyCall `pycall("script.py")` or `py"...code here..."`.
* (if you have time) Try profiling your fast matrix types from HW 2.  Where would you focus your efforts if you wanted increased speed?

## More speed

### [Type Definitions](https://docs.julialang.org/en/stable/manual/performance-tips/#Type-declarations-1)

When you declare types, you should (whenever possible) make fields a concrete type, not even a specific abstract type.  If you want to allow for multiple types in the field, parameterize your type.  If there is any ambiguity in what the actual instantiated type will be, the compiler will not be able to allocate space correctly, and will generally miss out on optimizations.

For more about type stability, check out the [`@code_warntype` macro](https://docs.julialang.org/en/stable/manual/performance-tips/#man-code-warntype-1)


In [54]:
function swapsub!(X::Array{T}, Y::Array{T}, inds) where T
    @views @. X[inds] = xor(X[inds], Y[inds])
    @views @. Y[inds] = xor(X[inds], Y[inds])
    @views @. X[inds] = xor(X[inds], Y[inds])
end

swapsub! (generic function with 1 method)

In [55]:
x = [1; 2; 3]
y = [4; 5; 6]
@time swapsub!(x, y, 1:2)
@show x
@show y
;

  0.360861 seconds (505.98 k allocations: 23.543 MiB, 9.42% gc time)
x = [4, 5, 3]
y = [1, 2, 6]


In [56]:
mutable struct AmbiguousType
    x
end

mutable struct StillAmbiguousType
    x::Real
end

mutable struct NonAmbiguousType
    x::Float64 
end

In [58]:
n = 2000
for T in (AmbiguousType, StillAmbiguousType, NonAmbiguousType)
    @time a = Array{T}(undef,n)
    t1 = @elapsed for i=1:n
        a[i] = T(randn())
    end
    println("$t1 seconds to fill array")
    s = T(0)
    t2 = @elapsed for i=1:n
        s.x += a[i].x
    end
    println("$t2 seconds to sum array")
end

  0.000019 seconds (8 allocations: 16.203 KiB)
0.000652452 seconds to fill array
0.000560974 seconds to sum array
  0.000008 seconds (8 allocations: 16.203 KiB)
0.001000907 seconds to fill array
0.00063341 seconds to sum array
  0.000011 seconds (8 allocations: 16.203 KiB)
0.000769774 seconds to fill array
0.000634539 seconds to sum array


## Array Declaration

When you use arrays, you should pre-allocate if possible.  Specific type information is also valuable. However, an abstract vector (eg. `Real[]`) is actually no better than a vector of `Any` `Any[]` for the sake of performance-- it is simply an organizational tool.

In [60]:
n = 10000
@time a1 = Real[] # Abstract type
@time a2 = Float64[] # specific type
@time a3 = Array{Real}(undef,n) # pre-allocated abstract type
@time a4 = Array{Float64}(undef,n) # pre-allocated specific type
;

  0.000006 seconds (5 allocations: 240 bytes)
  0.000007 seconds (5 allocations: 240 bytes)
  0.000061 seconds (13 allocations: 78.813 KiB)
  0.000023 seconds (13 allocations: 78.813 KiB)


In [61]:
function fill_push(a)
    for i = 1:n
        push!(a,rand())
    end
    return a
end


function fill_inbounds(a)
    for i = 1:n
        @inbounds a[i] = rand()
    end
end

fill_inbounds (generic function with 1 method)

In [63]:
@time fill_inbounds(a3)

  0.001849 seconds (38.98 k allocations: 765.469 KiB)


In [65]:
@time fill_inbounds(a4)

  0.001846 seconds (28.98 k allocations: 609.219 KiB)


In [67]:
@time fill_push(a1);

  0.002826 seconds (38.98 k allocations: 1021.531 KiB)


### Subnormal Numbers

You can treat sub-normal numbers as zero.  If a number is less than what can be represented using floating point, your computer may still represent it, and incur performance penalites (although this is required for IEEE standards, so be careful).  See [Denormal Numbers](https://en.wikipedia.org/wiki/Denormal_number) on Wikipedia for more info. The following example is from [Julia's documentation](https://docs.julialang.org/en/stable/manual/performance-tips/#treat-subnormal-numbers-as-zeros), and models the heat equation.  

In [69]:
function timestep(b::Vector{T}, a::Vector{T}, Δt::T ) where T
    @assert length(a)==length(b)
    n = length(b)
    b[1] = 1                            # Boundary condition
    for i=2:n-1
        b[i] = a[i] + (a[i-1] - T(2)*a[i] + a[i+1]) * Δt
    end
    b[n] = 0                            # Boundary condition
end

function heatflow( a::Vector{T}, nstep::Integer ) where T
    b = similar(a)
    for t=1:div(nstep,2)                # Assume nstep is even
        timestep(b,a,T(0.1))
        timestep(a,b,T(0.1))
    end
end

heatflow(zeros(Float32,10),2)           # Force compilation
for trial=1:6
    a = zeros(Float32,1000)
    set_zero_subnormals(iseven(trial))  # Odd trials use strict IEEE arithmetic
    @time heatflow(a,1000)
end

  0.005074 seconds (1 allocation: 4.063 KiB)
  0.002675 seconds (1 allocation: 4.063 KiB)
  0.003264 seconds (1 allocation: 4.063 KiB)
  0.003287 seconds (1 allocation: 4.063 KiB)
  0.005258 seconds (1 allocation: 4.063 KiB)
  0.002701 seconds (1 allocation: 4.063 KiB)


### [Access arrays in column-major order](https://docs.julialang.org/en/stable/manual/performance-tips/#Access-arrays-in-memory-order,-along-columns-1)
If you need to loop over an array, keep in mind that it is stored in column-major format, so looping over indices in reverse order will allow you to use blocks of memory more efficiently.

In [70]:
# access in Column-major order
function sum_array1(A::Array{T,3}) where T
    s::T = 0
    @simd for k=1:size(A,3)
        @simd for j=1:size(A,2)
            @simd for i=1:size(A,1)
                @inbounds s += A[i,j,k]
            end
        end
    end
    return s
end

# access in Row-major order
function sum_array2(A::Array{T,3}) where T
    s::T = 0
    @simd for i=1:size(A,1)
        @simd for j=1:size(A,2)
            @simd for k=1:size(A,3)
                @inbounds s += A[i,j,k]
            end
        end
    end
    return s
end
;

In [72]:
n = 300
A = rand(Int64,n,n,n)
@time sum_array1(A)
@time sum_array2(A)
;

  0.017753 seconds (5 allocations: 176 bytes)
  0.513237 seconds (5 allocations: 176 bytes)


In [73]:
A = [1 2; 3 4]
@show A
@show A[:]
;

A = [1 2; 3 4]
A[:] = [1, 3, 2, 4]


### [Minor Tweaks](https://docs.julialang.org/en/stable/manual/performance-tips/#Tweaks-1)

Julia's performance documentation suggests the following optimizations for making very fast inner loops:

* Avoid unnecessary arrays. For example, instead of sum([x,y,z]) use x+y+z.
* Use `abs2(z)` instead of `abs(z)^2` for complex z. In general, try to rewrite code to use `abs2()` instead of `abs()` for complex arguments. (This would be useful for writing fast Julia Set codes!)
* Use `div(x,y)` for truncating division of integers instead of `trunc(x/y)`, `fld(x,y)` instead of `floor(x/y)`, and `cld(x,y)` instead of `ceil(x/y)`.

## Exercise 3

* Write a standard matrix-vector multiplication function on two arrays.  Can you get close to the default implementation's performance (BLAS gemv)?
* How would you modify your routine to do a matrix transpose-vector multiplication routine? 
* Why do you think [BLAS's gemv](http://www.netlib.org/lapack/explore-html/dc/da8/dgemv_8f.html) takes the arguments that it does?

## Additional Performance Analysis

* [Lint](https://github.com/tonyhffong/Lint.jl) - analyze code for potential improvements