# Neural Networks in PyTorch
## Chapter 2: Automatic Differentiation
Yen Lee Loh, 2021-9-1; 2022-9-22

---
## 0.  Setup

In [2]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
plt.rcParams.update ({'font.family':'serif', 'font.size':14})

---
## 1. Automatic differentiation
Let $x=2$ and $Y=\tanh x$.  Now suppose we want to calculate $dY/dx$.

In a **symbolic differentiation** approach, one would get the computer to calculate the derivative analytically as $dY/dx={\rm sech}^2 x$, and then substitute in the value of $x$ to obtain $dY/dx_{x=2}=0.07$.

In an **automatic differentiation** approach, used in most implementations of artificial neural networks, the computer does not need to do any symbolic calculus.  Instead, every function "knows" its own derivative.  For example, when PyTorch calculates $Y=\tanh x$ for a certain value of $x$, PyTorch also calculates $dY/dx = {\rm sech}^2 x$ for that value of $x$.  It is likely that the actual implementation calculates $dY/dx = 1-Y^2$.

In [6]:
x = torch.tensor(2.0, requires_grad=True)   # input  x = 2
Y = torch.tanh(x)                           # output Y = tanh(x)
Y.backward()      # calculate gradient of Y with respect to everything

In [7]:
print ("x         =", x)
print ("Y         =", Y)
print ("dY/dx     =", x.grad)  # the gradient of Y with respect to x

x         = tensor(2., requires_grad=True)
Y         = tensor(0.9640, grad_fn=<TanhBackward0>)
dY/dx     = tensor(0.0707)


In [8]:
print ("sech^2(x) =", 1/torch.cosh(x)**2)
print ("1-Y**2    =",1-Y**2)

sech^2(x) = tensor(0.0707, grad_fn=<MulBackward0>)
1-Y**2    = tensor(0.0707, grad_fn=<RsubBackward1>)


The example below does the same thing, except that $f$ is a neural network layer, and we feed in the input $x$ to get the output $Y$.

In [9]:
x = torch.tensor(2.0, requires_grad=True) # input           x = 2
f = nn.Tanh()                             # define a layer  f(x) = tanh(x)
Y = f(x)                                  # output          Y = tanh(2)
#f.zero_grad()
Y.backward()
print ("x         =", x)
print ("Y         =", Y)
print ("dY/dx     =", x.grad)

x         = tensor(2., requires_grad=True)
Y         = tensor(0.9640, grad_fn=<TanhBackward0>)
dY/dx     = tensor(0.0707)


---
## 2. Chain rule
Let $x=2$, $y=\tanh x$, and $z=\tanh y$.  Then
$\frac{dz}{dx} = \frac{dy}{dx} \frac{dz}{dy}$.



In [11]:
x = torch.tensor (2.0, requires_grad=True) # set       x = 2
y = torch.tanh (x) ; y.retain_grad()       # calculate y = tanh(x)
z = torch.tanh (y)                         # calculate z = tanh(y)
z.backward()                               # calculate dz/dy = sech^2(y)
                                           # and       dy/dx = sech^2(x)
                                           # and thus  dz/dx
print ("x              =", x)
print ("y              =", y)
print ("z              =", z)
print ("dz/dy (auto)   =", y.grad)
print ("dz/dx (auto)   =", x.grad)
print ()
# print ("du/dx (manual) = 1-u**2 =", 1-u**2)
# print ("dv/du (manual) = 1-v**2 =", 1-v**2)
# print ("dv/dx (manual) = (1-u**2)*(1-v**2) =", (1-u**2)*(1-v**2))

x              = tensor(2., requires_grad=True)
y              = tensor(0.9640, grad_fn=<TanhBackward0>)
z              = tensor(0.7461, grad_fn=<TanhBackward0>)
dz/dy (auto)   = tensor(0.4434)
dz/dx (auto)   = tensor(0.0313)



The code below does essentially the same thing, for a network of the form

    x -------(tanh)-------> y --------(tanh)--------> z

In [12]:
x = torch.tensor(2.0, requires_grad=True)
f = nn.Sequential (nn.Tanh(), nn.Tanh())
z = f(x)
z.backward()
print ("x         =", x)
print ("z         =", z)
print ("dz/dx     =", x.grad)

x         = tensor(2., requires_grad=True)
z         = tensor(0.7461, grad_fn=<TanhBackward0>)
dz/dx     = tensor(0.0313)


---
## 3. Layers with learnable parameters
Let $x=2$ and $y=wx+b$, where $x$ is the input, $y$ is the output, and $w$ and $b$ are parameters.  For our present purposes, we may think of $w$ and $b$ as additional inputs:

<img src="SKETCHES/linearlayer2.png"/>

Then, automatic differentiation produces

$$\frac{\partial y}{\partial x} = w$$

$$\frac{\partial y}{\partial w} = x$$

$$\frac{\partial y}{\partial b} = 1$$.

In [13]:
x = torch.tensor(2.0, requires_grad=True)
w = torch.tensor(3.0, requires_grad=True)
b = torch.tensor(4.0, requires_grad=True)
y = w * x + b
y.backward()
print ("x         = ", x)
print ("w         = ", w)
print ("b         = ", b)
print ("y         = ", y)
print ("dy/dx = w = ", x.grad)
print ("dy/dw = x = ", w.grad)
print ("dy/db = 1 = ", b.grad)

x         =  tensor(2., requires_grad=True)
w         =  tensor(3., requires_grad=True)
b         =  tensor(4., requires_grad=True)
y         =  tensor(10., grad_fn=<AddBackward0>)
dy/dx = w =  tensor(3.)
dy/dw = x =  tensor(2.)
dy/db = 1 =  tensor(1.)


Later, we will see that training a network involves adjusting parameters (such as $w$ and $b$) according to the derivatives of the objective function with respect to those parameters (such as $\frac{\partial \varepsilon}{\partial w}$ and $\frac{\partial \varepsilon}{\partial b}$).  These derivatives can collectively be referred to as the **gradient** of the objective function.  Computing the gradient of the objective function using automatic differentiation with the chain rule is referred to as **backpropagation**.

---
## 4. Layers with multiple parameters
A general linear layer is of the form ${\bf Y} = {\bf W}\cdot {\bf X} + {\bf B}$.  That is,

$$Y_{i} = \sum_j W_{ij} X_{j} + B_i.$$

The partial derivatives are

$$\frac{\partial Y_i}{\partial X_j} = W_{ij} $$

$$\frac{\partial Y_k}{\partial W_{ij}} = \delta_{ki} X_{j} $$

$$\frac{\partial Y_i}{\partial B_i} = 1. $$


---
## 5. Example exercise
Consider a neural network implementing the function ${\bf Y} = \tanh( {\bf W}\cdot {\bf X} + {\bf B} )$.  Find the partial derivatives of the vector ${\bf Y}$ with respect to the parameters ${\bf W}$ and ${\bf B}$.