# Automatic Differentiation
AD is an extremely powerful tool. In julia, you can differentiate almost any valid julia program to obtain derivatives, gradients, jacobians and hessians etc. automatically, with high performance. This is what makes up the bulk of most deep-learning libraries, but contrary to most libraries, you do not need to write your code using a subset of julia or a DL-specific language, you can just write regular julia code.

There are a number of different kinds of AD. In the following, we will refer to a function 
$$ f : \mathbb{R}^n -> \mathbb{R}^m$$

## Forward-mode AD
Using dual numbers, forward-mode AD perfrorms a single forward pass of your program, calculating both the function value and gradients in one go. FAD is algorithmically favorable when $f$ is "few to many", or $n < m$. It also typically has the least overhead, so is competetive when both $n$ and $m$ are small.

## Reverse-mode AD
This is what is used in DL libraries. RAD works by constructing a computation graph, either before execution (as old tensorflow) or during the execution (most common today).
RAD is algorithmically favorable when $f$ is "many to few", or $n > m$. This is the case in most DL, where the cost function is a scalar-valued function of very many parameters, the NN weights. For functions with many outputs, 

In [57]:
using ForwardDiff, BenchmarkTools

f(x) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

x = rand(5) # small size 
g = x -> ForwardDiff.gradient(f, x); # g = ∇f
g(x)

5-element Array{Float64,1}:
 1.56704478753878  
 1.8334625791389418
 1.4096073759914765
 2.032177627585189 
 2.1173987352023795

In [58]:
@btime g($x);

  558.234 ns (3 allocations: 688 bytes)


In [59]:
ForwardDiff.hessian(f, x)

5×5 Array{Float64,2}:
 0.804648  2.07193  1.7873   2.41604   2.57671 
 2.07193   0.73617  2.11972  2.86537   3.05593 
 1.7873    2.11972  1.48638  2.47167   2.63599 
 2.41604   2.86537  2.47167  0.815578  3.56341 
 2.57671   3.05593  2.63599  3.56341   0.860096

In [60]:
using Zygote

gz = x -> Zygote.gradient(f, x)[1]; # g = ∇f
gz(x) ≈ g(x)

true

In [61]:
@btime gz($x);

  23.898 μs (321 allocations: 11.19 KiB)


In [62]:
using Flux.Tracker
using Flux.Tracker: data

gf = x -> data(Tracker.gradient(f, x)[1]) # g = ∇f
gf(x) ≈ g(x)

true

In [63]:
@btime gf($x);

  9.068 μs (143 allocations: 5.18 KiB)


If we change the size of the input vector, the relative timings change

In [64]:
x = rand(5000)
@btime g($x);

  103.389 ms (5 allocations: 548.17 KiB)


In [65]:
#@btime gz($x);

In [66]:
@btime gf($x);

  362.723 μs (157 allocations: 551.40 KiB)


Most AD (except for Zygote and perhaps Yota) in julia works by either overloading `Base` functions on custom types. This means that you can not use AD if you restrict the input types to your functions too much! In the following example, the input is restricted to `Vector{Float64}`

In [72]:
x = abs.(randn(3))
f2(x::Vector{Float64}) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);
f2(x)

1.0202877839345177

In [73]:
ForwardDiff.gradient(f2, x);

This did not work, since ForwardDiff  calls the function with an argument of type `Vector{<: Dual}`

In [74]:
Tracker.gradient(f2, x);

MethodError: MethodError: no method matching f2(::TrackedArray{…,Array{Float64,1}})
Closest candidates are:
  f2(!Matched::Array{Float64,1}) at In[72]:2
  f2(!Matched::Array{T,1} where T) at In[47]:1

This didn't work either, since `Tracker` calls the function with an argument of type `TrackedArray`.

In [75]:
f3(x::Vector) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);
ForwardDiff.gradient(f2, x)

3-element Array{Float64,1}:
 0.528434956628905 
 1.2242482859668935
 1.1822048076863592

In [76]:
Tracker.gradient(f3, x);

MethodError: MethodError: no method matching f3(::TrackedArray{…,Array{Float64,1}})
Closest candidates are:
  f3(!Matched::Array{T,1} where T) at In[75]:1