## PyTorch Tutorial
MILA, November 2017

### Torch Autograd, Variables, Define-by-run & Execution Paradigm

Adapted from
1. http://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#sphx-glr-beginner-blitz-autograd-tutorial-py 
2. http://pytorch.org/docs/master/notes/autograd.html

## Variables : Thin wrappers around tensors to facilitate autograd

Supports almost all operations that can be performed on regular tensors

In [4]:
import numpy as np
from __future__ import print_function

In [5]:
import torch 
from torch.autograd import Variable

![caption](images/Variable.png)

### Wrap tensors in a Variable

In [17]:
z = Variable(torch.Tensor(5, 3).uniform_(-1, 1))
print(z)

Variable containing:
 0.3175 -0.1359  0.4432
-0.7333  0.9455 -0.8064
 0.3552 -0.5301  0.0297
 0.5149  0.9356 -0.1797
 0.4917  0.3282 -0.4619
[torch.FloatTensor of size 5x3]



### Properties of Variables : Requiring gradients, Volatility, Data & Grad

1. You can access the raw tensor through the .data attribute
2. Gradient of the loss w.r.t. this variable is accumulated into .grad.
3. Stay tuned for requires_grad and volatile

In [26]:
print('Requires Gradient : %s ' % (z.requires_grad))
print('Volatile : %s ' % (z.volatile))
print('Gradient : %s ' % (z.grad))
print(z.data)

Requires Gradient : False 
Volatile : False 
Gradient : None 

 0.3175 -0.1359  0.4432
-0.7333  0.9455 -0.8064
 0.3552 -0.5301  0.0297
 0.5149  0.9356 -0.1797
 0.4917  0.3282 -0.4619
[torch.FloatTensor of size 5x3]



In [30]:
### Operations on Variables
x = Variable(torch.Tensor(5, 3).uniform_(-1, 1))
y = Variable(torch.Tensor(3, 5).uniform_(-1, 1))
z = torch.mm(x, y)
print(z.size())

torch.Size([5, 5])


## Define-by-run Paradigm

The torch autograd package provides automatic differentiation for all operations on Tensors.

PyTorch's autograd is a reverse mode automatic differentiation system.

Backprop is defined by how your code is run, and that every single iteration can be different.

Other frameworks that adopt a similar approach :

1. Chainer - https://github.com/chainer/chainer
2. DyNet - https://github.com/clab/dynet
3. Tensorflow Eager

### How autograd encodes execution history


Conceptually, autograd maintains a graph that records all of the operations performed on variables as you execute your operations. This results in a directed acyclic graph whose leaves are the input variables and roots are the output variables. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

![caption](images/dynamic_graph.gif)

GIF source: https://github.com/pytorch/pytorch

Internally, autograd represents this graph as a graph of Function objects (really expressions), which can be `apply()` ed to compute the result of evaluating the graph. When computing the forwards pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient (the `.grad_fn` attribute of each Variable is an entry point into this graph). When the forwards pass is completed, we evaluate this graph in the backwards pass to compute the gradients.

In [35]:
x = Variable(torch.Tensor(5, 3).uniform_(-1, 1))
y = Variable(torch.Tensor(3, 5).uniform_(-1, 1))
z = torch.mm(x, y)
print(z.grad_fn)

<torch.autograd.function.AddmmBackward object at 0x7ff1b19b6270>


An important thing to note is that the graph is recreated from scratch at every iteration, and this is exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don’t have to encode all possible paths before you launch the training - what you run is what you differentiate.

## Getting gradients : `backward()` & `torch.autograd.grad`

In [78]:
x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)
y = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)
z = x ** 2 + 3 * y
z.backward(gradient=torch.ones(5, 3))

In [79]:
torch.eq(x.grad, 2 * x)

Variable containing:
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
[torch.ByteTensor of size 5x3]

In [80]:
y.grad

Variable containing:
 3  3  3
 3  3  3
 3  3  3
 3  3  3
 3  3  3
[torch.FloatTensor of size 5x3]

In [85]:
x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)
y = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)
z = x ** 2 + 3 * y
dz_dx = torch.autograd.grad(z, x, grad_outputs=torch.ones(5, 3))
dz_dy = torch.autograd.grad(z, y, grad_outputs=torch.ones(5, 3))

## Define-by-run example

### Common Variable definition

In [107]:
x = Variable(torch.Tensor(5, 3).uniform_(-1, 1), requires_grad=True)
y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)
z = Variable(torch.Tensor(5,).uniform_(-1, 1), requires_grad=True)

### Graph 1 : `xy + z`

In [110]:
o = torch.mm(x, y) + z
do_dinputs_1 = torch.autograd.grad(o, [x, y, z], grad_outputs=torch.ones(5, 5))

In [111]:
print('Gradients of o w.r.t inputs in Graph 1')
print('do/dx : \n\n %s ' % (do_dinputs_1[0]))
print('do/dy : \n\n %s ' % (do_dinputs_1[1]))
print('do/dz : \n\n %s ' % (do_dinputs_1[2]))

Gradients of o w.r.t inputs in Graph 1
do/dx : 
 Variable containing:
 0.5620  1.1932  0.9028
 0.5620  1.1932  0.9028
 0.5620  1.1932  0.9028
 0.5620  1.1932  0.9028
 0.5620  1.1932  0.9028
[torch.FloatTensor of size 5x3]
 
do/dy : 
 Variable containing:
-0.5016 -0.5016 -0.5016 -0.5016 -0.5016
-0.3178 -0.3178 -0.3178 -0.3178 -0.3178
-0.2748 -0.2748 -0.2748 -0.2748 -0.2748
[torch.FloatTensor of size 3x5]
 
do/dz : 
 Variable containing:
 5
 5
 5
 5
 5
[torch.FloatTensor of size 5]
 


### Graph 2 : xy / z

In [112]:
o = torch.mm(x, y) / z
do_dinputs_2 = torch.autograd.grad(o, [x, y, z], grad_outputs=torch.ones(5, 5))

In [113]:
print('Gradients of o w.r.t inputs in Graph 2')
print('do/dx : \n %s ' % (do_dinputs_2[0]))
print('do/dy : \n %s ' % (do_dinputs_2[1]))
print('do/dz : \n %s ' % (do_dinputs_2[2]))

Gradients of o w.r.t inputs in Graph 2
do/dx : 
 Variable containing:
-1.5378  9.1510  8.7001
-1.5378  9.1510  8.7001
-1.5378  9.1510  8.7001
-1.5378  9.1510  8.7001
-1.5378  9.1510  8.7001
[torch.FloatTensor of size 5x3]
 
do/dy : 
 Variable containing:
-0.5047 -0.9068 -3.6041 -0.8424  3.3000
-0.3197 -0.5745 -2.2832 -0.5336  2.0905
-0.2765 -0.4968 -1.9745 -0.4615  1.8079
[torch.FloatTensor of size 3x5]
 
do/dz : 
 Variable containing:
  0.5015
 -0.1811
 14.0846
  1.1818
 -9.6384
[torch.FloatTensor of size 5]
 


## Excluding subgraphs from backward : requires_grad=False, volatile=True & .detach()

### `requires_grad=False`

1. If there’s a single input to an operation that requires gradient, its output will also require gradient.

2. Conversely, only if all inputs don’t require gradient, the output also won’t require it.

3. Backward computation is never performed in the subgraphs, where all Variables didn’t require gradients.

In [114]:
x = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=False)
y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=False)
z = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)

In [119]:
o = x + y
print(o.requires_grad)
o = x + y + z
print(o.requires_grad)

False
True


### `volatile=True`

1. If a single input to an operation is volatile, the resulting operation will not have a `grad_fn` attribute and 

2. Conversely, only if all inputs are not volatile, the output will have a `grad_fn`.

In [120]:
x = Variable(torch.Tensor(3, 5).uniform_(-1, 1), volatile=True)
y = Variable(torch.Tensor(3, 5).uniform_(-1, 1), volatile=True)
z = Variable(torch.Tensor(3, 5).uniform_(-1, 1), requires_grad=True)

In [122]:
o = x + y
print(o.requires_grad)
print(o.grad_fn)
o = x + y + z
print(o.requires_grad)
print(o.grad_fn)

False
None
False
None
