# [Model-Building Basics](https://github.com/FluxML/Flux.jl/blob/master/docs/src/models/basics.md#model-building-basics)

## Taking Gradients

* ~~_**It seems that Flux mainly use [ForwardDiff](https://github.com/JuliaDiff/ForwardDiff.jl) to implement the backward computation of an operator**_.~~
  * ~~Will forward AD ad be much slower than the reverse mode AD?~~
  
  **Flux implements its own reverse-mode AD**.

* The [`gradient` function](https://github.com/FluxML/Flux.jl/blob/master/src/tracker/back.jl#L180) takes another Julia function $f$ and a set of arguments, and returns the gradient with respect to each argument.

In [41]:
using Flux.Tracker

f(x) = 3x^2 + 2x + 1

df(x) = Tracker.gradient(f, x)[1]
d2f(x) = Tracker.gradient(df, x)[1]

df(2)
d2f(2)

6.0 (tracked)

* Machine learning models can have hundreds of parameters.
* [param](https://github.com/FluxML/Flux.jl/blob/master/src/treelike.jl#L42) tells Flux to treat something as a learnable parameter.
* Then we can tell `gradient` to collect the gradients of all of them at once.

In [42]:
W = param(2)
b = param(3)

f(x) = W * x + b

params = Params([W, b])
grads = Tracker.gradient(() -> f(4), params)

@show grads[W]
@show grads[b]

grads[W] = 4.0
grads[b] = 1.0


1.0

* Tracked things behave like normal numbers or arrays, but keep records of everything you do with them.
* `Tracked` allows Flux to calculate gradients of tracked things.
* `gradient` takes a zero-argument function
  * no arguments are necessary because the Params tell it what to differentiate.

## Building Layers

In [43]:
function linear(in, out)
  W = param(randn(out, in))
  b = param(randn(out))
  x -> W * x .+ b
end

linear (generic function with 1 method)

* In the above cell, the return value of the last line: `x -> W * x .+ b` is returned, which is a definition of an anonymous function. 

In [44]:
# Building models by stacking multiple layers.

# Define 2 linear layers
linear1 = linear(5, 3)
linear2 = linear(3, 2)

# stack 2 linear layers
predict(x) = linear2(linear1(x))

# define the loss function
x, y = rand(5), rand(2)  # random test data
loss(x, y) = sum((predict(x) .- y).^2)

# all the learnable parameters in the model
params = Params([linear1.W, linear1.b, linear2.W, linear2.b])
grads = Tracker.gradient(() -> loss(x, y), params)

@show grads[linear1.W]

grads[linear1.W] = Flux.Tracker.TrackedReal{Float64}[-5.72233 (tracked) -30.4402 (tracked) -9.67905 (tracked) -8.99108 (tracked) -0.504939 (tracked); 1.4963 (tracked) 7.95964 (tracked) 2.53092 (tracked) 2.35103 (tracked) 0.132034 (tracked); 0.337755 (tracked) 1.7967 (tracked) 0.571297 (tracked) 0.53069 (tracked) 0.0298036 (tracked)]


Tracked 3×5 Array{Float64,2}:
 -5.72233   -30.4402   -9.67905   -8.99108  -0.504939 
  1.4963      7.95964   2.53092    2.35103   0.132034 
  0.337755    1.7967    0.571297   0.53069   0.0298036