# Lecture 1: Boxes and Registers

This is an [IJulia notebook](https://github.com/JuliaLang/IJulia.jl) (the [Jupyter](http://jupyter.org/)-based front-end for Julia) for [18.S096 at MIT in IAP 2017](https://math.mit.edu/classes/18.S096/iap17/), designed to accompany the first lecture.

The basic goal of this lecture is to understand why some code (in some languages and/or styles of coding) is slow while other code is fast, based on whether it can be compiled to efficiently use the CPU registers and low-level arithmetic instructions, or whether it relies on "boxed" types that force "dynamic" computations.

To illustrate this, we will implement a **sum** function `sum(a)`, which computes

$$
\mathrm{sum}(a) = \sum_{i=1}^n a_i
$$

for an array `a` with `n` elements.   We will use the built-in `sum` functions in Julia and Python along with hand-coded implementations in C, Python, and Julia.

We will use some tricks so that we can write and benchmark C, Python, and Julia code all in the same notebook.  In the case of Python, this will rely on the [PyCall](https://github.com/JuliaPy/PyCall.jl) package to call Python from Julia.  We will use the [BenchmarkTools](https://github.com/JuliaCI/BenchmarkTools.jl) Julia package to collect benchmarking statistics for us.

## Low-level C code

To start with, we will write a baseline implementation in the low-level C programming language.  Our C function `c_sum` will only work for a single data type: an array `X` of double-precision floating-point values (`double` in C, or `Float64` in Julia).

(In contrast, our Julia code, and some of our Python code, will work for any numeric type; we'll see whether we pay a price for this.)

Julia can easily call C functions in shared libraries via its `ccall` syntax.  So, we'll take our C routine (in a string) and pipe it through the C compiler `gcc` to produce a shared library file that we can load and call.

In [1]:
C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""
# compile to a shared library by piping C_code to gcc:
# (only works if you have gcc installed)
const Clib = tempname()
open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code)
end
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)

c_sum (generic function with 1 method)

Of course, we should first check whether our function is correct, by comparing it to Julia's built-in `sum` function on an array of $10^7$ random numbers.  Different floating-point algorithms for the `sum` function give slightly different results (Julia's `sum` algorithm is actually *much more accurate* than the one here, but that's a story for another day), so we'll compute their "fractional difference" or "relative error" and make sure that this is small.  

(Double-precision floating-point arithmetic keeps about 15 decimal digits, so any relative error close to $10^{-15}$ is a reasonable amount of roundoff error.)

In [2]:
# define a function to compute the relative (fractional) error |x-y| / mean(|x|,|y|)
relerr(x,y) = abs(x - y) * 2 / (abs(x) + abs(y))

a = rand(10^7) # array of random numbers in [0,1)
relerr(c_sum(a), sum(a))

1.7358575333763837e-13

Collecting accurate benchmarking statistics can be a tricky business, so we'll use the Julia `BenchmarkTools` package to do most of the work.   If you don't have it installed, you may need to type `Pkg.add("BenchmarkTools")` to tell Julia to download and install it.

It defines a *macro* `@benchmark` that takes some Julia code and *transforms* it into a benchmark measuring the speed of that code.   We pass the argument `a` of the `c_sum` function to be benchmarked by the special syntax `$a` for technical reasons, basically to make sure that Julia's analysis of the variable `a` happens *before* the benchmark starts.  Macro syntax and **metaprogramming** will be a topic for another lecture.

In [3]:
using BenchmarkTools

c_bench = @benchmark c_sum($a)

BenchmarkTools.Trial: 
  samples:          520
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     8.14 ms (0.00% GC)
  median time:      9.10 ms (0.00% GC)
  mean time:        9.61 ms (0.00% GC)
  maximum time:     14.58 ms (0.00% GC)

Typically, the most useful number to look at in a benchmark is the **minimum time**.  Basically, your computer is doing lots of things all of the time, and creates random interruptions that can cause spikes in the timing that we want to ignore.

Here, the minimum time is around **8 ms** for summing $10^7$ numbers, or about **1 billion additions per second**.  That sounds like a lot, but on my **2.5GHz laptop** it is well below the peak rate at which the computer can perform arithmetic.   It doesn't reach the peak arithmetic rate because for each floating-point addition, the processor needs to perform several additional calculations to load the next element of the array from memory, not to mention the time for the memory access itself.  Of course, you may get a slightly different number if you run this benchmark on a different computer.

This **8 ms** number for type-specific compiled C code is a baseline against which we will compare our other implementations of summation, below.

## Python sum functions

Now, we'll call the Python functions and benchmark them.   The PyCall package allows to load Python as a library and to call it directly from Julia, sharing memory with Python and passing data and functions back and forth.  There is very little overhead to this, and in any case we will be summing $10^7$ numbers so the overhead of the Julia/Python interface is negligible compared to the cost of the summation itself.

In [4]:
using PyCall
PyCall.pyversion

v"2.7.12"

### built-in `sum` of a Python `list`

To start with, I will convert our array `a` into a Python `list` (the built-in Python array-like data structure), and sum it with the built-in Python `sum` function.

In [5]:
# call a low-level PyCall function to get a Python list, because
# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):
apy_list = PyCall.array2py(a, 1, 1)
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")

PyObject <built-in function sum>

Let's check that we can call it, and that it computes the same answer as the Julia `sum`:

In [6]:
relerr(pysum(apy_list), sum(a))

1.7358575333763837e-13

Now, we'll benchmark it:

In [7]:
py_list_bench = @benchmark $pysum($apy_list)

BenchmarkTools.Trial: 
  samples:          56
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  672.00 bytes
  allocs estimate:  19
  minimum time:     73.23 ms (0.00% GC)
  median time:      91.34 ms (0.00% GC)
  mean time:        90.37 ms (0.00% GC)
  maximum time:     104.63 ms (0.00% GC)

It takes **70 ms**, or almost **10x slower** than the C routine above.  This is true **even though the Python sum** function is [written in almost 200 lines of C code](https://github.com/python/cpython/blob/b5be30c92ff8cc0ac83f48014fe78bc048141021/Python/bltinmodule.c#L2191-L2365).   The problem is that the Python code *pays a price for being generic*: it handles arbitrary iterable data structures of arbitrary numeric "boxes" (`PyObject` pointers), and has to perform lots of computations both to fetch each item and also to perform each addition.

### NumPy `sum` of a NumPy `array`

You can do *much better* if you can take advantage of the fact that *all of the elements are the same type*.  Then, you can store the array as the actual floating-point data stored consecutively in memory (not an array of pointers to boxes), and your inner loop can be fast because the type checks can occur *outside* the loop. In Python, this kind of **homogeneous array** is exactly what is provided by [NumPy](http://www.numpy.org/). Internally, a NumPy array is essentially just a wrapper around a C-like `double*` array.   NumPy
also provides `numpy.sum` function that can sum a NumPy array quickly.

There is a catch, though: NumPy itself is written mostly in C, not Python.  And because C code is not type-generic, in order to handle a wide variety of NumPy array types (integers, double precision, single precision, etcetera), NumPy uses rather tricky **auto-generated C code**.  And even then it can only handle a small set of commonly used types; you can't define your own types and sum them quickly.

Anyway, we can easily convert a Julia array to a NumPy array with PyCall (in fact, PyCall does this by
by default), and benchmark `numpy.sum`:

In [8]:
numpy_sum = pyimport("numpy")["sum"]
apy_numpy = PyObject(a) # converts to a numpy array by default
py_numpy_bench = @benchmark $numpy_sum($apy_numpy)

BenchmarkTools.Trial: 
  samples:          1150
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  960.00 bytes
  allocs estimate:  25
  minimum time:     3.94 ms (0.00% GC)
  median time:      4.29 ms (0.00% GC)
  mean time:        4.35 ms (0.00% GC)
  maximum time:     7.80 ms (0.00% GC)

WOW, it is actually **roughly twice as fast** as our C function!

The reason for this extra boost is that the NumPy functions exploit [SIMD instructions](https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions): special CPU instructions that can perform multiple additions at once, which we didn't use in our C code above.

### Hand-written Python `sum` function

To complete the story, let's write our own `mysum` function in Python that sums an arbitrary Python list (or array, or any iterable Python container).

Of course, you would never do this for summation — you would always use one of the built-in
functions in practice.  But someday, you will inevitably run into a problem where the
performance-critical code has not already been written for you, and you will need to write
your own.  So it is a good exercise to see how easy it is to get performance that
is comparable to the library routines.

In [9]:
# It currently takes a little bit of hackery to define a custom Python function
# in a Julia string and call it via PyCall, sorry:
syms = PyDict{AbstractString, PyObject}()
syms["syms"] = PyObject(Any[])
pyeval("""
def mysum(a):
    s = 0.0
    for x in a:
        s = s + x
    return s

syms.insert(0, mysum)
""", PyAny, syms, PyCall.Py_file_input)
mysum_py = syms["syms"][1] # a reference to the Python mysum function

PyObject <function mysum at 0x336a197d0>

As usual, let's check that it works, first:

In [10]:
relerr(mysum_py(apy_list), sum(a))

1.7358575333763837e-13

Now, let's time it on our Python list:

In [11]:
@benchmark $mysum_py($apy_list)

BenchmarkTools.Trial: 
  samples:          4
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  672.00 bytes
  allocs estimate:  19
  minimum time:     1.30 s (0.00% GC)
  median time:      1.32 s (0.00% GC)
  mean time:        1.33 s (0.00% GC)
  maximum time:     1.36 s (0.00% GC)

Yikes, **1.3 seconds**.  That's **20× slower than the Python `sum`** and almost **200× slower than our C code**.

Using our `mysum` function with the NumPy array is no better, and in fact is a bit worse:

In [12]:
@benchmark $mysum_py($apy_numpy)

BenchmarkTools.Trial: 
  samples:          3
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  960.00 bytes
  allocs estimate:  25
  minimum time:     1.82 s (0.00% GC)
  median time:      1.83 s (0.00% GC)
  mean time:        1.85 s (0.00% GC)
  maximum time:     1.88 s (0.00% GC)

You can't take advantage of the NumPy array format in Python itself — you still have to write the performance-critical code in C (or hope someone else has written it for you).

## Built-in Julia `sum` function

Now, let's try the same thing in Julia, starting with the built-in Julia `sum` function operating on our array `a`:

In [13]:
j_bench = @benchmark sum($a)

BenchmarkTools.Trial: 
  samples:          1150
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     3.74 ms (0.00% GC)
  median time:      4.10 ms (0.00% GC)
  mean time:        4.35 ms (0.00% GC)
  maximum time:     10.03 ms (0.00% GC)

Hooray **3.7 ms**.  That's more than **2× the speed of the C code**, and slightly faster than even `numpy.sum`.  Again, you can guess that it must be using SIMD to beat the C code.  And, again, it must be taking advantage of the fact that the array is homogeneous.

The type of `a` is:

In [14]:
typeof(a)

Array{Float64,1}

This is the Julia type for a 1-dimensional array of `Float64` values (64-bit "double" precision floating-point numbers, equivalent to C `double`).  Because the type of the elements is "attached" to the type of the array (a "parameterized" type, more on this later), Julia is able to store it as a "flat" array of consecutive `Float64` values in memory.

In contrast, the Julia equivalent of a Python `list` is a `Vector{Any}` (a synonym for `Array{Any,1}`): internally, this is an array of pointers to "boxes" that can hold any type (`Any`).  This makes things *much* slower: each `+` computation on an `Any` value must dynamically look up the type of object, figure out what `+` function to call, and allocate a new "box" to store the result:

In [15]:
a_any = Vector{Any}(a)
j_bench_any = @benchmark sum($a_any)

BenchmarkTools.Trial: 
  samples:          24
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  153.09 mb
  allocs estimate:  10032767
  minimum time:     201.18 ms (1.76% GC)
  median time:      209.44 ms (1.83% GC)
  mean time:        208.61 ms (1.86% GC)
  maximum time:     224.10 ms (1.95% GC)

This is **187 ms**, or a **50× slowdown**.  It is more than **2× slower than the Python `sum(list)` code**, in fact.  The Python `sum` function is better optimized than the Julia `sum` function for dealing with untyped (`Any`) values, in part because in Julia it is expected that you will use "concretely" typed arrays in all performance-critical cases.

Unlike NumPy, however, Julia allows you to make efficient homogeneous arrays for any data type, even data types you define yourself, and you can operate on them efficiently with code written in Julia itself.

## Hand-written Julia `sum` functions

Let's try to write our own `sum` function in Julia, just as we wrote our own Python function.  We'll implement four different versions and see how they compare.  We'll start simple:

In [16]:
function mysum1(A)
    s = 0
    for a in A
        s += a
    end
    return s
end
relerr(mysum1(a), sum(a))

1.7358575333763837e-13

In [17]:
j1_bench = @benchmark mysum1($a)

BenchmarkTools.Trial: 
  samples:          24
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  457.76 mb
  allocs estimate:  30000000
  minimum time:     203.60 ms (3.78% GC)
  median time:      208.03 ms (3.85% GC)
  mean time:        208.85 ms (3.85% GC)
  maximum time:     215.05 ms (3.72% GC)

Yikes!  The performance is **terrible**, more than **200ms**.

However, it turns out we made a simple mistake: we had a **type instability** in our code, because the summation variable `s` **changes type** in our function.  It starts out as an *integer* `0` (a Julia `Int`), but once we do `s += a` (shorthand for `s = s + a`), the result *changes* to a floating-point value.  Julia's compiler is quite simple-minded about such things: **once it sees that a variable changes type, it assumes it can be any type** and stores it in a "box".

One symptom of this is the "allocs estimate: 30000000" line above: it is doing *zillions of allocations* in order to allocate all the "boxes" for the untyped `s` values.  Another way of seeing this would be to use the `@code_warntype` macro provided with Julia:

In [18]:
@code_warntype mysum1(a)

Variables:
  #self#::#mysum1
  A::Array{Float64,1}
  s[1m[31m::Any[0m
  #temp#@_4::Int64
  a::Float64
  #temp#@_6::LambdaInfo
  #temp#@_7::Float64

Body:
  begin 
      s[1m[31m::Any[0m = 0 # line 3:
      #temp#@_4::Int64 = $(QuoteNode(1))
      4: 
      unless (Base.box)(Base.Bool,(Base.not_int)((#temp#@_4::Int64 === (Base.box)(Int64,(Base.add_int)((Base.arraylen)(A::Array{Float64,1})::Int64,1)))::Bool)) goto 29
      SSAValue(2) = (Base.arrayref)(A::Array{Float64,1},#temp#@_4::Int64)::Float64
      SSAValue(3) = (Base.box)(Int64,(Base.add_int)(#temp#@_4::Int64,1))
      a::Float64 = SSAValue(2)
      #temp#@_4::Int64 = SSAValue(3) # line 4:
      unless (Core.isa)(s[1m[31m::Union{Float64,Int64}[0m,Float64)[1m[31m::Any[0m goto 14
      #temp#@_6::LambdaInfo = LambdaInfo for +(::Float64, ::Float64)
      goto 23
      14: 
      unless (Core.isa)(s[1m[31m::Union{Float64,Int64}[0m,Int64)[1m[31m::Any[0m goto 18
      #temp#@_6::LambdaInfo = LambdaInfo for +(::Int64, :

The tip-off here is the `s::Any` lines, telling you that `s` is stored in a "box" that holds type `Any`.  In computer-science lingo, we would say that **compiler type inference has failed** to determine the type of `s`.   More on this below.

One solution would be to initialize `s = 0.0`, i.e. start `s` out as a floating-point value.  This would work for our floating-point array `a`, but then wouldn't work for other types of arrays.  Instead, Julia provides an `eltype(A)` function to fetch the *type of the elements* of `A`, and a `zero` function to *initialize `s` to the correct type of zero* for `A`:

In [19]:
function mysum2(A)
    s = zero(eltype(A))
    for a in A
        s += a
    end
    return s
end
relerr(mysum2(a), sum(a))

1.7358575333763837e-13

In [20]:
j2_bench = @benchmark mysum2($a)

BenchmarkTools.Trial: 
  samples:          548
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     8.14 ms (0.00% GC)
  median time:      8.81 ms (0.00% GC)
  mean time:        9.13 ms (0.00% GC)
  maximum time:     13.18 ms (0.00% GC)

Now that's more like it!  It runs in **8 ms**, essentially the **same speed as the hand-written C code**.

*Unlike* the C code, however, it works for *any* type of array, and in fact just about any type of "iterable container" (as long as it provides an `eltype` method), and can **sum any type of value** (as long as `zero` and `+` are defined), including user-defined types.  (We'll give an example below).

The performance does not quite match the Julia built-in `sum` function or the `numpy.sum` function, however.  Our guess above was that they were exploiting SIMD optimizations.  However, we can do that too, in our own Julia code, by using the `@simd` decorator to tell Julia's compiler to turn on SIMD optimizations for that loop.

(SIMD optimizations are not turned on by default, because they only speed up very particular kinds of code, and turning them on everywhere would slow down the compiler too much.)

In [21]:
function mysum3(A)
    s = zero(eltype(A))
    @simd for a in A
        s += a
    end
    return s
end
relerr(mysum3(a), sum(a))

1.1361299306432493e-14

In [22]:
j3_bench = @benchmark mysum3($a)

BenchmarkTools.Trial: 
  samples:          1166
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     3.68 ms (0.00% GC)
  median time:      4.11 ms (0.00% GC)
  mean time:        4.29 ms (0.00% GC)
  maximum time:     7.81 ms (0.00% GC)

Hooray!  **3.7 ms**, basically the same speed as Julia's built-in `sum` function and `numpy.sum`!   And it only required **7 lines of code**, some care with types, and a very minor bit of wizardry with the `@simd` tag to get the last factor of two.

Moreover, the code is still **type generic**: it can sum any container of any type that works with addition.   And we **didn't have to declare any types** of any arguments or variables; the compiler figured everything out.  How?

## Type inference and specialization

To go any further, you need to understand something very basic about how Julia works.  Suppose we define a very simple function:

In [23]:
f(x) = x + 1

f (generic function with 1 method)

We didn't declare the type of `x`, and so our function `f(x)` will work with **any type of `x`** (as long as the `+ 1` operation is defined for that type):

In [24]:
f(3) # x is an integer (technically, a 64-bit integer)

4

In [25]:
f(3.1) # x is a floating-point value (Float64)

4.1

In [26]:
f([1,2,3]) # x is an array of integers

3-element Array{Int64,1}:
 2
 3
 4

How can a function like `f(x)` work for any type?  In Python, `x` would be a "box" that could contain anything, and it would then look up *at runtime* how to compute `x + 1`.  But we just saw that untyped Julia `sum` code could be fast.

The secret is **just-in-time (JIT) compilation**.   The first time you call `f(x)` **with a new type of argument** `x`, it will **compile a new version of `f` specialized for that type**.  The *next* time it calls `f(x)` with the same argument type

So, right now, after evaluating the above code, we have *three* versions of `f` compiled and sitting in memory: one for `x` of type `Int` (we say `x::Int` in Julia), one for `x::Float64`, and one for `x::Vector{Int}`.

We can even see what the compiled code for `f(x::Int)` looks like, either the [compiler (LLVM) bytecode](https://en.wikipedia.org/wiki/LLVM) or the low-level (below C!) [assembly code](https://en.wikipedia.org/wiki/Assembly_language):

In [27]:
@code_llvm f(1)


define i64 @julia_f_72157(i64) #0 {
top:
  %1 = add i64 %0, 1
  ret i64 %1
}


In [28]:
@code_native f(1)

	.section	__TEXT,__text,regular,pure_instructions
Filename: In[23]
	pushq	%rbp
	movq	%rsp, %rbp
Source line: 1
	leaq	1(%rdi), %rax
	popq	%rbp
	retq
	nopw	(%rax,%rax)


Let's break this down.  When you tell Julia's compiler that `x` is an `Int`, it:

* It knows `x` fits into a 64-bit CPU register (and is passed to the function via a register).

* Looks at `x + 1`.  Since `x` and `1` are both `Int`, it knows it should call the `+` function for two `Int` values.  This corresponds to *one machine instruction* `leaq` to add two 64-bit registers.

* Since the `+` function here is so simple, it won't bother to do a function call.  It will [inline](https://en.wikipedia.org/wiki/Inline_expansion) the `(+)(Int,Int)` function into the compiled `f(x)` code.

* Since it now knows what `+` function it is calling, it knows that the *result* of the `+` is *also* an `Int`, and it can return it via register.

This process works recursively if we define a new function `g(x)` that calls `f(x)`:

In [29]:
g(x) = f(x) + 3
g(1)

5

In [30]:
@code_llvm g(1)


define i64 @julia_g_72261(i64) #0 {
top:
  %1 = add i64 %0, 4
  ret i64 %1
}


When it specialized `g` for `x::Int`, it not only figured out what `f` function to call, it not only *inlined* `f(x)` *into `g`*, but the compiler was smart enough to *combine the two additions* into a single addition `x + 4`.

## Defeating type inference: Type instabilities

To get good performance, there are some fairly simple rules that you need to follow in Julia code to avoid defeating the compiler's type inference.   See also the [performance tips section of the Julia manual](http://docs.julialang.org/en/stable/manual/performance-tips/).

Three of the most important are:

* Don't use (non-constant) global variables in critical code — put your critical code into a function (this is good advice anyway, from a software-engineering standpoint).  The compiler assumes that a **global variable can change type at any time**, so it is always stored in a "box", and "taints" anything that depends on it.

* Local variables should be "type-stable": **don't change the type of a variable inside a function**.  Use a new variable instead.

* Functions should be "type-stable": **a function's return type should only depend on the argument types, not on the argument values**.

To diagnose all of these problems, the `@code_warntype` macro that we used above is your friend.  If it labels any variables (or the function's return value) as `Any` or `Union{...}`, it means that the compiler couldn't figure out a precise type.

The third point, type-stability of functions, leads to lots of important but subtle choices in library design.  For example, consider the (built-in) `sqrt(x)` function, which computes $\sqrt{x}$:

In [31]:
sqrt(2)

1.4142135623730951

You might think that `sqrt(-1)` should return $i$ (or `im`, in Julia syntax).  (Matlab's `sqrt` function does this.)  Instead, we get:

In [32]:
sqrt(-1)

LoadError: DomainError:
sqrt will only return a complex result if called with a complex argument. Try sqrt(complex(x)).

In [33]:
sqrt(-1 + 0im)

0.0 + 1.0im

Why did Julia implement `sqrt` in this silly way, throwing an error for negative arguments unless you add a zero imaginary part?  Any reasonable person wants an imaginary result from `sqrt(-1)`, surely?

The problem is that defining `sqrt` to return an imaginary result from `sqrt(-1)` would **not be type stable**: `sqrt(x)` would return a real result for non-negative real `x`, and a complex result for negative real `x`, so the **return type would depend on the value of `x`** and **not just its type.**

That would defeat type inference, not just for the `sqrt` function, but for **anything the sqrt function touches**.  Unless the compiler can somehow figure out `x ≥ 0`, it will have to either store the result in a "box" or compile two branches of the result.  Let's see how that works by defining our own square-root function:

In [34]:
mysqrt(x::Complex) = sqrt(x)
mysqrt(x::Real) = x < 0 ? sqrt(complex(x)) : sqrt(x)

mysqrt (generic function with 2 methods)

This definition is an example of Julia's [multiple dispatch style](http://docs.julialang.org/en/stable/manual/methods/), which in some sense is a generalization of object-oriented programming but focuses on "verbs" (functions) rather than nouns.  We will discuss this more in a later lecture.

The `::Complex` and `::Real` are argument-type declarations.  Such declarations are **not related to performance**, but instead **act as a "filter"** to allow us to have one version of `mysqrt` for complex arguments and another for real arguments.

In [35]:
mysqrt(2)

1.4142135623730951

In [36]:
mysqrt(-2)

0.0 + 1.4142135623730951im

In [37]:
mysqrt(-2+0im)

0.0 + 1.4142135623730951im

Looks great, right?  But let's see what happens to type inference in a function that calls `mysqrt` instead of `sqrt`:

In [59]:
slowfun(x) = mysqrt(x) + 1
@code_warntype slowfun(2)

Variables:
  #self#::#slowfun
  x::Int64
  #temp#@_3[1m[31m::Union{Complex{Float64},Float64}[0m
  #temp#@_4::LambdaInfo
  #temp#@_5[1m[31m::Union{Complex{Float64},Float64}[0m

Body:
  begin 
      # meta: location In[34] mysqrt 2
      unless (Base.slt_int)(x::Int64,0)::Bool goto 5
      #temp#@_3[1m[31m::Union{Complex{Float64},Float64}[0m = $(Expr(:invoke, LambdaInfo for sqrt(::Complex{Float64}), :(Base.sqrt), :($(Expr(:new, Complex{Float64}, :((Base.box)(Float64,(Base.sitofp)(Float64,x))), :((Base.box)(Float64,(Base.sitofp)(Float64,0))))))))
      goto 7
      5: 
      #temp#@_3[1m[31m::Union{Complex{Float64},Float64}[0m = (Base.Math.box)(Base.Math.Float64,(Base.Math.sqrt_llvm)((Base.box)(Float64,(Base.sitofp)(Float64,x::Int64))))::Float64
      7: 
      # meta: pop location
      unless (Core.isa)(#temp#@_3[1m[31m::Union{Complex{Float64},Float64}[0m,Float64)[1m[31m::Any[0m goto 12
      #temp#@_4::LambdaInfo = LambdaInfo for +(::Float64, ::Int64)
      goto 21
  

Because the compiler **doesn't know at compile-time that x is positive** (at compile-time it **uses only types, not values**, it doesn't know whether the result is real (`Float64`) or complex (`Complex{Float64}`) and has to store it in a "box".  This kills performance.

## Defining our own types

Let's define our own type to represent a **"point" in two dimensions**.  Each point will have an $(x,y)$ location.  So that we can use the points with our `sum` functions above, we'll also define `+` and `zero` functions to do the obvious **vector addition**.

The simplest such definition in Julia is:

In [39]:
type Point1
    x
    y
end
Base.:+(p::Point1, q::Point1) = Point1(p.x + q.x, p.y + q.y)
Base.zero(::Type{Point1}) = Point1(0,0)

Point1(3,4)

Point1(3,4)

In [40]:
Point1(3,4) + Point1(5,6)

Point1(8,10)

Our type is very generic, and can hold any type of `x` and `y` values:

In [41]:
Point1(3.7, 4+5im)

Point1(3.7,4 + 5im)

Perhaps too generic:

In [42]:
Point1("x", [3,4,5])

Point1("x",[3,4,5])

Since `x` and `y` can be *anything*, they must be **pointers to "boxes"**.  This is **bad news for performance**.

A `type` is *mutable*, which means we can create a `Point1` object and then change `x` or `y`:

In [43]:
p = Point1(3,4)
p.x = 7
p

Point1(7,4)

This means that every reference to a `Point1` object must be a *pointer* to an object stored elsewhere in memory, because *how else would we "know" when an object changes?*  Furthermore, an **array of `Point1` objects must be an array of pointers** (which is **bad news for performance** again):

In [44]:
P = [p,p,p]

3-element Array{Point1,1}:
 Point1(7,4)
 Point1(7,4)
 Point1(7,4)

In [45]:
p.y = 8
P

3-element Array{Point1,1}:
 Point1(7,8)
 Point1(7,8)
 Point1(7,8)

Let's test this out by creating an array of `Point1` objects and summing it.  Ideally, this would be about twice as slow as summing an equal-length array of numbers, since there are twice as many numbers to sum.  But because of all of the boxes and pointer-chasing, it should be far slower.

To create the array, we'll call the `Point1(x,y)` constructor with our array `a`, using Julia's ["dot-call" syntax](http://docs.julialang.org/en/stable/manual/functions/#dot-syntax-for-vectorizing-functions) that applies a function "element-wise" to arrays:

In [46]:
a1 = Point1.(a, a)

10000000-element Array{Point1,1}:
 Point1(0.0854553,0.0854553)
 Point1(0.885596,0.885596)  
 Point1(0.204148,0.204148)  
 Point1(0.291398,0.291398)  
 Point1(0.491761,0.491761)  
 Point1(0.385552,0.385552)  
 Point1(0.858012,0.858012)  
 Point1(0.610907,0.610907)  
 Point1(0.202762,0.202762)  
 Point1(0.772584,0.772584)  
 Point1(0.0807838,0.0807838)
 Point1(0.942604,0.942604)  
 Point1(0.525602,0.525602)  
 ⋮                          
 Point1(0.450704,0.450704)  
 Point1(0.6765,0.6765)      
 Point1(0.0218983,0.0218983)
 Point1(0.60217,0.60217)    
 Point1(0.601585,0.601585)  
 Point1(0.906846,0.906846)  
 Point1(0.0805392,0.0805392)
 Point1(0.346368,0.346368)  
 Point1(0.764284,0.764284)  
 Point1(0.420253,0.420253)  
 Point1(0.596481,0.596481)  
 Point1(0.485411,0.485411)  

In [47]:
@benchmark sum($a1)

BenchmarkTools.Trial: 
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  610.35 mb
  allocs estimate:  29999997
  minimum time:     445.14 ms (5.68% GC)
  median time:      484.73 ms (5.89% GC)
  mean time:        479.87 ms (6.14% GC)
  maximum time:     505.38 ms (6.74% GC)

In [60]:
@benchmark mysum3($a1)

BenchmarkTools.Trial: 
  samples:          11
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  610.35 mb
  allocs estimate:  30000001
  minimum time:     471.91 ms (6.24% GC)
  median time:      488.49 ms (6.34% GC)
  mean time:        486.19 ms (6.39% GC)
  maximum time:     502.19 ms (6.23% GC)

The time is about **450–500ms**, at least 50× slower than we would like, but consistent with our other timing results on "boxed" values from above.

By the way, we can also use our `relerr` function to compare the two summation algorithms (which work without modification on our new type, hooray!), but we'll need to define `-` and `abs` first:

In [49]:
Base.:-(p::Point1, q::Point1) = Point1(p.x - q.x, p.y - q.y)
Base.abs(p::Point1) = hypot(p.x, p.y) # sqrt(p.x^2 + p.y^2)
relerr(sum(a1),mysum3(a1))

1.7358575333763837e-13

### An imperfect solution: A concrete immutable type

We can avoid these two problems by:

* Declare the types of `x` and `y` to be *concrete* types, so that they don't need to be pointers to boxes.
* Declare our Point to be an `immutable` type (`x` and `y` cannot change), so that Julia is not forced to make every reference to a Point into a pointer.

In [50]:
immutable Point2
    x::Float64
    y::Float64
end
Base.:+(p::Point2, q::Point2) = Point2(p.x + q.x, p.y + q.y)
Base.zero(::Type{Point2}) = Point2(0.0,0.0)

Point2(3,4)

Point2(3.0,4.0)

In [51]:
Point2(3,4) + Point2(5,6)

Point2(8.0,10.0)

In [52]:
p = Point2(3,4)
P = [p,p,p]

3-element Array{Point2,1}:
 Point2(3.0,4.0)
 Point2(3.0,4.0)
 Point2(3.0,4.0)

In [53]:
p.x = 6 # gives an error since p is immutable

LoadError: type Point2 is immutable

If this is working as we hope, then summation should be much faster:

In [54]:
a2 = Point2.(a,a)
@benchmark sum($a2)

BenchmarkTools.Trial: 
  samples:          352
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     10.15 ms (0.00% GC)
  median time:      13.24 ms (0.00% GC)
  mean time:        14.21 ms (0.00% GC)
  maximum time:     23.04 ms (0.00% GC)

Now the time is **only 10ms**, only slightly more than twice the cost of summing an array of individual numbers of the same length!

Unfortunately, we paid a big price for this performance: our `Point2` type only works with *a single numeric type* (`Float64`), much like a C implementation.

### The best of both worlds: Parameterized immutable types

How do we get a `Point` type that works for *any* type of `x` and `y`, but at the same time allows us to have an array of points that is concrete and homogeneous (every point in the array is forced to be the same type)?  At first glance, this seems like a contradiction in terms.

The answer is not to define a *single* type, but rather to **define a whole family of types** that are *parameterized* by the type `T` of `x` and `y`.  In computer science, this is known as [parametric polymorphism](https://en.wikipedia.org/wiki/Parametric_polymorphism).  (An example of this can be found in [C++ templates](https://en.wikipedia.org/wiki/Template_%28C%2B%2B%29).)

In Julia, we will define such a family of types as follows:

In [55]:
immutable Point3{T<:Real}
    x::T
    y::T
end
Base.:+(p::Point3, q::Point3) = Point3(p.x + q.x, p.y + q.y)
Base.zero{T}(::Type{Point3{T}}) = Point3(zero(T),zero(T))

Point3(3,4)

Point3{Int64}(3,4)

Here, `Point3` is actually a family of subtypes `Point{T}` for different types `T`.   The notation `<:` in Julia means "is a subtype of", and hence `T<:Real` means that we are constraining `T` to be a `Real` type (a built-in *abstract type* in Julia that includes e.g. integers or floating point).

In [56]:
Point3(3,4) + Point3(5.6, 7.8)

Point3{Float64}(8.6,11.8)

Now, let's make an array:

In [57]:
a3 = Point3.(a,a)

10000000-element Array{Point3{Float64},1}:
 Point3{Float64}(0.0854553,0.0854553)
 Point3{Float64}(0.885596,0.885596)  
 Point3{Float64}(0.204148,0.204148)  
 Point3{Float64}(0.291398,0.291398)  
 Point3{Float64}(0.491761,0.491761)  
 Point3{Float64}(0.385552,0.385552)  
 Point3{Float64}(0.858012,0.858012)  
 Point3{Float64}(0.610907,0.610907)  
 Point3{Float64}(0.202762,0.202762)  
 Point3{Float64}(0.772584,0.772584)  
 Point3{Float64}(0.0807838,0.0807838)
 Point3{Float64}(0.942604,0.942604)  
 Point3{Float64}(0.525602,0.525602)  
 ⋮                                   
 Point3{Float64}(0.450704,0.450704)  
 Point3{Float64}(0.6765,0.6765)      
 Point3{Float64}(0.0218983,0.0218983)
 Point3{Float64}(0.60217,0.60217)    
 Point3{Float64}(0.601585,0.601585)  
 Point3{Float64}(0.906846,0.906846)  
 Point3{Float64}(0.0805392,0.0805392)
 Point3{Float64}(0.346368,0.346368)  
 Point3{Float64}(0.764284,0.764284)  
 Point3{Float64}(0.420253,0.420253)  
 Point3{Float64}(0.596481,0.596481)  
 Point3

Note that the type of this array is `Array{Point3{Float64},1}` (we could equivalently write this as `Vector{Point3{Float64}}`, since `Vector{T}` is a synonym for `Array{T,1}`).  You should learn a few things from this:

* An `Array{T,N}` in Julia is itself a parameterized type, parameterized by the element type `T` and the dimensionality `N`.

* Since the element type `T` is encoded in the `Array{T,N}` type, the element type does not need to be stored in each element.  That means that the `Array` is free to store an array of "inlined" elements, rather than an array of pointers to boxes.  (This is why `Array{Float64,1}` earlier could be stored in memory like a C `double*`.

* It is still important that the element type be `immutable`, since an array of mutable elements would still need to be an array of pointers (so that it could "notice" if another reference to an element mutates it).

In [58]:
@benchmark sum($a3)

BenchmarkTools.Trial: 
  samples:          407
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  0.00 bytes
  allocs estimate:  0
  minimum time:     10.10 ms (0.00% GC)
  median time:      11.71 ms (0.00% GC)
  mean time:        12.28 ms (0.00% GC)
  maximum time:     23.28 ms (0.00% GC)

Hooray! It is again **only 10ms**, the same time as our completely concrete and inflexible `Point2`.