# Automatic differentiation


PyTorch records all computations to be able to backpropagate through them. That is, provided a sequence of operations that starts from a tensor $\theta$ to define a scalar $g(\theta)$, it is able to compute 
$\nabla_\theta g(\theta)$ exactly, with only one function call.

The **autograd** package allows automatic differentiation for all operations on Tensors.

*autograd.Variable* is the main class of the package it has 3 attributes :
  - .data : contains the data (tensor) stored in a variable
  - .grad : gradient w.r.t. this variable
  - .grad_fn : contains the *Function* that has created the variable (for user's variables = None)

Let's test this on a simple example

In [3]:
!pip3 install torch torchvision

Collecting torch
  Downloading torch-0.3.1-cp36-cp36m-manylinux1_x86_64.whl (496.4MB)
[K    100% |████████████████████████████████| 496.4MB 2.7kB/s 
[?25hCollecting torchvision
  Downloading torchvision-0.2.0-py2.py3-none-any.whl (48kB)
[K    100% |████████████████████████████████| 51kB 8.6MB/s 
Collecting pillow>=4.1.1 (from torchvision)
  Downloading Pillow-5.1.0-cp36-cp36m-manylinux1_x86_64.whl (2.0MB)
[K    100% |████████████████████████████████| 2.0MB 651kB/s 
Installing collected packages: torch, pillow, torchvision
  Found existing installation: Pillow 4.0.0
    Uninstalling Pillow-4.0.0:
      Successfully uninstalled Pillow-4.0.0
Successfully installed pillow-5.1.0 torch-0.3.1 torchvision-0.2.0


In [0]:
import torch as th
from torch.autograd import Variable

## Define a function f

def f(x):
    return th.sqrt(th.sum(th.pow(x,2), dim=0))

In [0]:
# Define a point x on a GPU

dtype = th.cuda.FloatTensor
n = 10
x = th.randn(n).type(dtype)

# or directly : 
# x = th.randn(n).cuda()

### Define a variable 

x = Variable(x, requires_grad=True)

In [12]:
# Attributes of a variable

print(x.grad)
print(x.grad_fn)
print(x.data)

None
None

 0.4928
 1.9446
 1.1037
 1.8665
 0.0570
-0.0988
 0.6488
-0.6188
-1.7103
-0.3091
[torch.cuda.FloatTensor of size 10 (GPU 0)]



Now we can compte f(x) 

In [33]:
out = f(x)
print(out)

Variable containing:
 3.5446
[torch.cuda.FloatTensor of size 1 (GPU 0)]



After calculating the output, we can do a backprop and get the gradiant w.r.t. x

In [0]:
out.backward(create_graph = True) # creat graph useful for higher order derivative products

In [35]:
print(x.grad)
print(out.grad_fn)

Variable containing:
 0.1390
 0.5486
 0.3114
 0.5266
 0.0161
-0.0279
 0.1830
-0.1746
-0.4825
-0.0872
[torch.cuda.FloatTensor of size 10 (GPU 0)]

<SqrtBackward object at 0x7fc007c22ac8>


In [36]:
# We can reset the data in the grad to zero

x.grad.data.zero_()


 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
[torch.cuda.FloatTensor of size 10 (GPU 0)]

We can also work with multivariables functions

In [0]:
## Define a function

def g(x,y,z):
  return (th.log(x**2 + y**2) + z*x*y).mean()


## Define variables : Only x and y will be considered as variables, z is going to be a pramater

x = Variable(th.randn(n).cuda(), requires_grad=True)
y = Variable(th.randn(n).cuda(), requires_grad=True)
z = Variable(th.randn(n).cuda(), requires_grad=False)



In [60]:
out = g(x,y,z)
out

Variable containing:
 0.3290
[torch.cuda.FloatTensor of size 1 (GPU 0)]

In [0]:
out.backward()

We can now get the computed grads:

- x.grad will contains $\nabla_x g(x, y)$ 
- y.grad will contains $\nabla_y g(x, y)$ 

In [62]:
# it contains dg(x,y)/dx
x.grad

Variable containing:
 0.3210
 0.0265
 0.1718
-0.1191
-0.1462
 0.1195
-0.1124
-0.0202
-0.0497
-0.4000
[torch.cuda.FloatTensor of size 10 (GPU 0)]

## Non-scalar functions

Let's consider the following example : 

$$ y = Mx $$

the derivative of y with respect to x is $M^T$. To do so, we need to specify in the backward method a *grad_tensors* (should be the same lenght as the output)

In [74]:
x = Variable(th.FloatTensor([[2,1]]).cuda(), requires_grad=True)
M = Variable(th.FloatTensor([[1,2],[3,4]]).cuda()) 
y = th.mm(x,M)
print("y:",y)
print("M :",M)

y: Variable containing:
 5  8
[torch.cuda.FloatTensor of size 1x2 (GPU 0)]

M : Variable containing:
 1  2
 3  4
[torch.cuda.FloatTensor of size 2x2 (GPU 0)]



In [76]:
### Do the backpro with respect to the first elemnt of y  : dy1/dx (I mean dy1/dx1 and dy1/dx2)

y.backward(th.FloatTensor([[1, 0]]).cuda(),create_graph = True)
print(x.grad.data)
x.grad.data.zero_() #remove gradient in x.grad, or it will be accumulated

### Do the backpro with respect to the second elemnt of y  : dy2/dx 

y.backward(th.FloatTensor([[0, 1]]).cuda(),create_graph = True)
print(x.grad.data)
x.grad.data.zero_() 


 1  3
[torch.cuda.FloatTensor of size 1x2 (GPU 0)]


 2  4
[torch.cuda.FloatTensor of size 1x2 (GPU 0)]




 0  0
[torch.cuda.FloatTensor of size 1x2 (GPU 0)]

##  Higher order derivative 

Consider the following function : $ out = x^2 + xy + y^2$

x and y are matrices. We can get the first and second derivative easly as shown in the following cells : 

In [34]:
x = Variable(th.randn(2, 2), requires_grad=True)
y = Variable(th.randn(2, 2), requires_grad=True)

out = x ** 2 + x*y + y ** 2
out

Variable containing:
 1.1912  1.2992
 1.7304  1.6375
[torch.FloatTensor of size 2x2]

In [0]:
# do Backward

out.backward(th.ones(2, 2), create_graph=True) # we put th.ones(2,2) to get the grad%x which is a matrix 

In [36]:
# x.grad will contains the grad of out % x 
x_grad = x.grad # = 2*x+y
x_grad

Variable containing:
-1.5591  1.8715
-0.4411 -1.7144
[torch.FloatTensor of size 2x2]

We can check that we got the right value

In [37]:
2*x + y

Variable containing:
-1.5591  1.8715
-0.4411 -1.7144
[torch.FloatTensor of size 2x2]

Now using the same trick, we can get higher order derivatives. Don't forget to set the x.grad to zero, otherwise we will accumulate derivatives.

In [38]:
x.grad.data.zero_() # If we don't use this, we will accumulate values

x_grad.backward(th.ones(2, 2))
x.grad

Variable containing:
 2  2
 2  2
[torch.FloatTensor of size 2x2]