# Automatic Differentiation
AD is an extremely powerful tool. In julia, you can differentiate almost any valid julia program to obtain derivatives, gradients, jacobians and hessians etc. automatically, with high performance. This is what makes up the bulk of most deep-learning libraries, but contrary to most libraries, you do not need to write your code using a subset of julia or a DL-specific language, you can just write regular julia code.

There are a number of different kinds of AD. In the following, we will refer to a function 
$$ f : \mathbb{R}^n -> \mathbb{R}^m$$

## Forward-mode AD
Using dual numbers, forward-mode AD perfrorms a single forward pass of your program, calculating both the function value and gradients in one go. FAD is algorithmically favorable when $f$ is "few to many", or $n < m$. It also typically has the least overhead, so is competetive when both $n$ and $m$ are small.

## Reverse-mode AD
This is what is used in DL libraries. RAD works by constructing a computation graph, either before execution (as old tensorflow) or during the execution (most common today).
RAD is algorithmically favorable when $f$ is "many to few", or $n > m$. This is the case in most DL, where the cost function is a scalar-valued function of very many parameters, the NN weights. For functions with many outputs, 

In [1]:
using ForwardDiff, BenchmarkTools

f(x) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);

x = rand(5) # small size 
g = x -> ForwardDiff.gradient(f, x); # g = ∇f
g(x)

5-element Array{Float64,1}:
 0.9311986082037566
 1.013853749301598 
 0.9231284141431625
 0.9426932669501211
 1.0602620613621536

In [2]:
@btime g($x);

  619.439 ns (3 allocations: 688 bytes)


In [3]:
ForwardDiff.hessian(f, x)

5×5 Array{Float64,2}:
 -0.39107     0.114972   0.0660604   0.0709956   0.197128 
  0.114972   -0.189085   0.111833    0.120184    0.333512 
  0.0660604   0.111833  -0.406827    0.0690565   0.191749 
  0.0709956   0.120184   0.0690565  -0.367545    0.206059 
  0.197128    0.333512   0.191749    0.206059   -0.0631403

In [4]:
using Zygote

f'(x) ≈ g(x)

┌ Info: Precompiling Zygote [e88e6eb3-aa80-5325-afca-941959d7151f]
└ @ Base loading.jl:1273
  ** incremental compilation may be fatally broken for this module **

  ** incremental compilation may be fatally broken for this module **



true

In [5]:
@btime f'($x);

  16.045 μs (188 allocations: 5.53 KiB)


If we change the size of the input vector, the relative timings change

In [6]:
x = rand(5000)
@btime g($x);

  137.402 ms (5 allocations: 548.17 KiB)


In [7]:
@btime f'($x);

  1.435 ms (45161 allocations: 1.49 MiB)


Most AD (except for Zygote and Yota) in julia works by overloading `Base` functions on custom types. This means that you can not use AD if you restrict the input types to your functions too much! In the following example, the input is restricted to `Vector{Float64}`

In [8]:
x = abs.(randn(3))
f2(x::Vector{Float64}) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);
f2(x)

1.2047561207063906

In [9]:
ForwardDiff.gradient(f2, x);

MethodError: MethodError: no method matching f2(::Array{ForwardDiff.Dual{ForwardDiff.Tag{typeof(f2),Float64},Float64,3},1})
Closest candidates are:
  f2(!Matched::Array{Float64,1}) at In[8]:2

This did not work, since ForwardDiff  calls the function with an argument of type `Vector{<: Dual}`

In [10]:
f2'(x)

3-element Array{Float64,1}:
 0.7053881781452638
 1.1332927723054662
 1.7900311946971623

This works since Zygote does not use dispatch on custom types.

In [11]:
f3(x::Vector) = sum(sin, x) + prod(tan, x) * sum(sqrt, x);
ForwardDiff.gradient(f3, x)

3-element Array{Float64,1}:
 0.7053881781452638
 1.1332927723054662
 1.7900311946971623