# 4.3 Automatic Differentiation
Getting better = Minimizing a loss function  
With neural networks we choose loss functions that are differentiable with respect to our parameters.  
The autograd package expedites this work by automatically calculating derivatices.

In [3]:
import torch
from torch.autograd import Variable

the mapping $y = 2\mathbf{x}^{\top}\mathbf{x}$ with respect to the column vector $\mathbf{x}$

In [4]:
x = Variable(torch.arange(4, dtype=torch.float32).reshape((4, 1)), requires_grad=True)
print(x)

tensor([[0.],
        [1.],
        [2.],
        [3.]], requires_grad=True)


변수 x를 생성하고 초기값 할당  
`requires_grad = True`

In [5]:
y = 2*torch.mm(x.t(), x)
print(y)

tensor([[28.]], grad_fn=<MulBackward0>)


(x.t() = $x^{\top}$임을 상기)  
`x`의 shape는 (4, 1)이고 `y`는 scalar이다.  

In [6]:
y.backward()

`backward` function으로 gradient를 구할 수 있다.

In [8]:
print("x.grad: ", x.grad)
print("x.grad_fn: ", x.grad_fn)
print("y.grad_fn: ", y.grad_fn)

x.grad:  tensor([[ 0.],
        [ 4.],
        [ 8.],
        [12.]])
x.grad_fn:  None
y.grad_fn:  <MulBackward0 object at 0x000001CB36A9CDC0>


In [9]:
print((x.grad - 4*x).norm().item() == 0)
print(x.grad)

True
tensor([[ 0.],
        [ 4.],
        [ 8.],
        [12.]])


# Computing the Gradient of Python Control Flow

In [10]:
def func(a):
    b = a * 2
    while b.norm().item() < 1000:
        b = b * 2
    if b.sum().item() > 0:
        c = b
    else:
        c = 100 * b
    return c

conditional and loop control

In [12]:
a = torch.randn(size=(1,))
a.requires_grad=True
d = func(a)
d.backward()

In [13]:
print(a.grad == (d / a))

tensor([True])


## Head gradients and the chain rule
Sometimes when we call the backward method, e.g. `y.backward()`, where
`y` is a function of `x` we are just interested in the derivative of
`y` with respect to `x`. Mathematicians write this as
$\frac{dy(x)}{dx}$. At other times, we may be interested in the
gradient of `z` with respect to `x`, where `z` is a function of `y`,
which in turn, is a function of `x`. That is, we are interested in
$\frac{d}{dx} z(y(x))$. Recall that by the chain rule

$$\frac{d}{dx} z(y(x)) = \frac{dz(y)}{dy} \frac{dy(x)}{dx}.$$

So, when ``y`` is part of a larger function ``z`` and we want ``x.grad`` to store $\frac{dz}{dx}$, we can pass in the *head gradient* $\frac{dz}{dy}$ as an input to ``backward()``. The default argument is ``torch.ones_like(y)``. See [Wikipedia](https://en.wikipedia.org/wiki/Chain_rule) for more details.

In [14]:
x = Variable(torch.tensor([[0.],[1.],[2.],[3.]]), requires_grad=True)
y = x * 2
z = y * x

head_gradient = torch.tensor([[10], [1.], [.1], [.01]])
z.backward(head_gradient)
print(x.grad)

tensor([[0.0000],
        [4.0000],
        [0.8000],
        [0.1200]])
