In [1]:
%load_ext notexbook

In [2]:
%texify

# Episode 1: The Learning Process

We will be starting our journey into **PyTorch** by immediately describing what is our _final goal_, that is:

**Understanding how the learning process of a Neural Network actually works**

<img src="./imgs/learning_process.png" alt="The learning process" class="maxw85" />

<span class="fn"><b>[1]:</b> Image from [Deep Learning with PyTorch, Luca Antiga et. al.](https://www.manning.com/books/deep-learning-with-pytorch)</span>

I am sure you can appreciate how Neural Networks are indeed simple machines (in the sense of based on a _simple learning process_).
However, as it is always customary to say:
> (Neural Networks)..are indeed that simple. The complexity comes with the details.

## Derivatives, Gradients, and Chain Rule for the Backward Propagation

The main principle on which the learning process of `NN` is built on leverages on a very powerful algorithm (and technique) called **Backward Propagation**.

More specifically, the Backward propagation for `NN` is more the _side-effect_ of the optimisation problem we are trying to solve. 

In optimisation, and operation research, it is well known that we could leverage on the **gradients** to calculate the _minimum_ (or the _maximum_ ) of any function $f(x), x \in \mathbb{R}^n$. 
In particular, if we follow the opposite of the direction of the gradient, we will be looking for a minimum of the target function $f(x)$.

In `NN` terms, this $f(x)$ is the **error** or the **loss** function that we want to minimise and in this sense, the **Backward Prop** is a side-effect of the optimisation as this method is the crucial foundation that we adopt to propagate the result of the minimisation throughout the connected layers. 

Therefore, now the question is: how to we calculate those gradients ?

In addition, two very compulsory requirements about this calculation: 
- **computationally exact**;  
- **computationally efficient**.

**TL;DR**:

(doc. reference: [torch autograd](https://pytorch.org/docs/stable/autograd.html))

`torch.autograd` provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. 

It requires <u>minimal changes</u> to the existing code - you only need to declare `Tensor`s for which gradients should be computed with the `requires_grad=True` keyword.

### Warm-up: `numpy` ol' friend

Numpy does not know **anything** about computation graphs, or deep learning, or gradients. 

However we can easily use `numpy` to fit a two-layer network to random data by **manually** implementing the `forward` and `backward` passes through the network using numpy primitives:

In [3]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

In [4]:
learning_rate = 1e-6

In this example we will be using the **ReLU** activaton function, defined as `relu(x) = max(0, x)`. This is mainly because it is well-known and popular, *it works*, and easy to differentiate (manually)

In [5]:
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 50 == 0:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
print('Final Loss: ', loss)

0 34642001.6149787
50 14497.04812171034
100 444.62275798080344
150 21.01969414987922
200 1.2131356849532264
250 0.0793073093235837
300 0.0056875138215903955
350 0.00044111774856022215
400 3.668645922244574e-05
450 3.2488248058539225e-06
Final Loss:  3.1889407822091545e-07


### `torch.Tensor`

``torch.Tensor`` is the central class of the package. 

Numpy is a great framework, but it cannot utilize GPUs to accelerate its numerical computations. For modern deep neural networks, GPUs often provide speedups of **50x or greater** ([some benchmark here](https://github.com/jcjohnson/cnn-benchmarks)), so unfortunately `numpy` won’t be enough for modern deep learning.

Let's rewrite the previous example using `torch.Tensor` and PyTorch API to _manually_ implement the forward and backward passes through the network:

In [4]:
import torch

dtype = torch.float
device = torch.device("cpu") #if torch.backends.mps.is_available() else torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6

In [5]:
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)  # x * w
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 50 == 0:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
    
print('Final Loss: ', loss)

0 32728604.0
50 12603.54296875
100 402.66143798828125
150 20.295324325561523
200 1.1860454082489014
250 0.07474895566701889
300 0.005151806864887476
350 0.0005459957174025476
400 0.0001209118781844154
450 4.5200235035736114e-05
Final Loss:  2.3586468159919605e-05


In the above examples, we had to manually implement both the forward and backward passes of our neural network. 

Manually implementing the backward pass is **not** ideal for networks bigger than just two layers.

### `torch.autograd`

The `autograd` package in PyTorch provides exactly this functionality. 

When using autograd, the forward pass of your network will define a **(dynamic) computational graph**;
_nodes_ in the graph will be _Tensors_, and _edges_ will be functions that produce output _Tensors_ from input _Tensors_. 

Backpropagating through this graph then allows to easily compute gradients.

##### `torch.autograd` & `torch.Tensor`

If you set its attribute ``.requires_grad`` as ``True``, it starts to track all operations on it. 

When you finish your computation you can call ``.backward()`` and have all the
gradients computed automatically. 

The gradient for this tensor will be accumulated into ``.grad`` attribute.

To stop a tensor from tracking history, you can call ``.detach()`` to detach
it from the computation history, and to prevent future computation from being
tracked.

To prevent tracking history (and using memory), you can also wrap the code block
in ``with torch.no_grad():``. 

This can be particularly helpful when evaluating a model because the model may have trainable parameters with ``requires_grad=True``, but for which we don't need the gradients.

There’s one more class which is very important for autograd implementation: a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic
graph, that encodes a complete history of computation. 

Each tensor has a ``.grad_fn`` attribute that references a ``Function`` that has created
the ``Tensor`` (except for Tensors created by the user - their ``grad_fn is None``).

If you want to compute the derivatives, you can call ``.backward()`` on a ``Tensor``. 

If ``Tensor`` is a scalar, you don’t need to specify any arguments to ``backward()``,
however if it has more elements, you need to specify a ``gradient``
argument that is a tensor of matching shape.

---
<span id="fn1">Source: [PyTorch Autograd Tutorial](https://github.com/pytorch/tutorials/blob/master/beginner_source/blitz/autograd_tutorial.py)</span>


In [6]:
# Create a tensor requiring a gradient
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


In [7]:
# Let's use this tensor in a function
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [8]:
# y was created as a result of an operation, so it has a ``grad_fn``.
print(y.grad_fn)

<AddBackward0 object at 0x7fb1286ba230>


In [9]:
# Let's do more operations on y
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


###### `tensor.requires_grad_` method

``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad`` flag in-place. 

The input flag defaults to ``False`` if not given.

In [10]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)

a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x7fb1289889d0>


##### Gradients

Let's backprop now!

Because ``out`` contains a single scalar, ``out.backward()`` is equivalent to ``out.backward(torch.tensor(1.))``.

In [11]:
out.backward()

In [12]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


In [13]:
x.data  # `.data` property returns the data of the tensor

tensor([[1., 1.],
        [1., 1.]])

 You should have got a matrix of ``4.5``. 
 
 Let’s call the ``out`` *Tensor* $o$.

We have that $o = \frac{1}{4}\sum_i z_i$, $z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.

Therefore, $\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, 
hence
$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.

Algorithmically, here is how backpropagation happens with a computation graph. 

> (Not the actual implementation, only representative)

```python

def backward (incoming_gradients):
	self.Tensor.grad = incoming_gradients

	for inp in self.inputs:
		if inp.grad_fn is not None:
			adjont = incoming_gradient * local_grad(self.Tensor, inp)
			inp.grad_fn.backward(adjont)
		else:
			continue
```

If you're interested in understanding more about **Autograd** and Automatic Differentiation, there is a dedicated notebook in addendum: [Autograd](./addendum_autograd.ipynb)

---

#### Two-Layer Network (MLP) using (built-in) `torch.autograd`

So, now going back to our previous example, let's rewrite our code in order to exploit the integrated `autograd` engine to calculate the gradients, without having to manually write the **backward** pass through the network:

In [14]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

This time we're going to **require gradient** (i.e. `required_grad=True`)

In [15]:
# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

In [16]:
learning_rate = 1e-6
for t in range(500):
    # Forward pass: 
    # ============
    # compute predicted y using operations on Tensors;
    # these are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand!
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor [of shape (1,)]
    loss = (y_pred - y).pow(2).sum()
    if t % 50 == 0:
        print(t, loss.item())  # loss.item() gets the scalar value held in the loss.

    # Backward pass: 
    # =============
    # Use autograd to compute the backward pass. 
    # This call will compute the gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call `w1.grad` and `w2.grad` will be Tensors holding the gradient
    # of the loss with respect to `w1` and `w2`, respectively.
    loss.backward()

    # Parameters Update (still manual)
    # =================
    # Manually update weights using gradient descent. 
    # NOTE: 
    # Using torch.no_grad()
    # because weights have `requires_grad=True`, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # We will use torch.optim.SGD later for the optimisation step.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

0 30439720.0
50 11787.865234375
100 376.2894592285156
150 17.460474014282227
200 0.9067532420158386
250 0.04993513226509094
300 0.003069826867431402
350 0.0003613127919379622
400 9.254566248273477e-05
450 3.8368314562831074e-05


**Addendum**: [TensorFlow 1.x Static Graph](addendum_tf_static_graph.ipynb) (Notes: Requires `TF2.x` installed)

---

### Defining new `autograd` functions

Under the hood, each primitive autograd operator is really two functions that operate on Tensors. 

The **forward** function computes output Tensors from input Tensors. The **backward** function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value.

In PyTorch we can easily define our own autograd operator by defining a subclass of 
`torch.autograd.Function` and implementing the forward and backward functions. 

We can then use our new `autograd` operator by constructing an instance and calling it like a function, passing Tensors containing input data.

##### `torch.autograd.Function` (see [doc](https://pytorch.org/docs/stable/autograd.html#function))

Records operation history and defines formulas for differentiating ops.

Every operation performed on `Tensor`s creates a new function object, that performs the computation, and records that it happened. 

The history is retained in the form of a DAG of functions, with edges denoting data dependencies 
`(input <- output)`. Then, when backward is called, the graph is processed in the **topological ordering**, by calling `backward()` methods of each `Function` object, and passing returned gradients on to next `Function`s.

Normally, the only way users interact with functions is by creating subclasses and defining new operations. 

**This is a recommended way of extending `torch.autograd`**.

Example:

```python
class Exp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        result = i.exp()
        ctx.save_for_backward(result)
        return result
    
    @staticmethod
    def backward(ctx, grad_output):
        result, = ctx.saved_tensors
        return grad_output * result

#Use it by calling the apply method:
output = Exp.apply(input)
```

In this example we define our own custom autograd function for performing the ReLU nonlinearity, and use it to implement our two-layer network:

<span id="fn1">source: [PyTorch Tensors and Autograd](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#pytorch-tensors-and-autograd)
    

In [17]:
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [18]:
dtype = torch.float
device = torch.device("cpu")
# device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6

In [19]:
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 467.94873046875
199 1.5108758211135864
299 0.009017663076519966
399 0.0002465758007019758
499 4.527672354015522e-05


### `torch.nn`

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which 
have learnable parameters which will be optimized during learning.

In TensorFlow, packages like **Keras**, (old **TensorFlow-Slim**, and **TFLearn**) provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the `nn` package serves this same purpose. 

The `nn` package defines a set of `Module`s, which are roughly equivalent to neural network layers. 

A `Module` receives input `Tensor`s and computes output `Tensor`s, but may also hold internal state such as `Tensor`s containing learnable parameters. 

The `nn` package also defines a set of useful `loss` functions that are commonly used when 
training neural networks.

In this example we use the `nn` package to implement our two-layer network:

In [20]:
import torch

In [21]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

In [22]:
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

In [23]:
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 666.2017211914062
50 30.090824127197266
100 1.7846870422363281
150 0.18202653527259827
200 0.024471638724207878
250 0.0038833986036479473
300 0.0006893288227729499
350 0.00013127917191013694
400 2.6404813979752362e-05
450 5.515003977052402e-06


---

### `torch.optim`

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (**using `torch.no_grad()` or `.data` to avoid tracking history in autograd**). 

This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like `AdaGrad`, `RMSProp`, 
`Adam`.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

Let's finally modify the previous example in order to use `torch.optim` and the `Adam` algorithm:

In [24]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

In [25]:
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [26]:
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 689.369873046875
50 203.2606201171875
100 48.96010971069336
150 7.991568565368652
200 1.2245500087738037
250 0.211031973361969
300 0.033494919538497925
350 0.00429914565756917
400 0.0004554758488666266
450 4.1447918192716315e-05


##### Model and Optimiser (w/ Parameters) at a glance

![model_and_optimiser](imgs/model_optim.png)

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_

### Can we do better ?

---

##### The Learning Process

![learning process sketch](./imgs/learning_process.png)

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_ </span>

Possible scenario:

- Specify models that are more complex than a sequence of existing (pre-defined) modules;
- Customise the learning procedure (e.g. _weight sharing_ ?)
- ?

For these cases, **PyTorch** allows to define our own custom modules by subclassing `nn.Module` and defining a `forward` method which receives the input data (i.e. `Tensor`) and returns the output (i.e. `Tensor`).

It is in the `forward` method that **all** the _magic_ of Dynamic Graph and `autograd` operations happen!

### PyTorch: Custom Modules

 Let's implement our **two-layers** model as a custom `nn.Module` subclass

In [27]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.hidden_activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        l1 = self.linear1(x)
        h_relu = self.hidden_activation(l1)
        y_pred = self.linear2(h_relu)
        return y_pred

In [28]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [29]:
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

In [30]:
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

In [31]:
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 702.1043701171875
50 35.65228271484375
100 2.995074510574341
150 0.3680258095264435
200 0.05210673436522484
250 0.008305832743644714
300 0.0014585581375285983
350 0.0002786656259559095
400 5.7221630413550884e-05
450 1.249218712473521e-05


#### What happened really? Let's have a closer look

```python
>>> model = TwoLayerNet(D_in, H, D_out)
```

This calls `TwoLayerNet.__init__` **constructor** method (_implementation reported below_ ):

```python
def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.hidden_activation = torch.nn.ReLU()
    self.linear2 = torch.nn.Linear(H, D_out)
```

1. First thing, we call the `nn.Module` constructor which sets up the housekeeping
    - If you forget to do that, you will get and error message reminding that you should call it before using any `nn.Module` capabilities
2. We create a class attribute for each layer (`OP/Tensor/`) that we intend to include in our model
    - These can be also `Sequential` as in _Submodules_ or *Block of Layers*
    - **Note**: We are **not** defining the Graph yet, just the layer!

```python
>>> y_pred = model(x)
```

1. First thing to notice: the `model` object is **callable**
   - It means `nn.Module` is implementing a `__call__` method
   - We **don't** need to re-implement that!
   
2. (in fact) The `nn.Module` class will call `self.forward` - in a [Template Method Pattern](https://en.wikipedia.org/wiki/Template_method_pattern) fashion
    - for this reason, we have to define the `forward` method
    - (needless to say) the `forward` method implements the **forward** pass of our model

`from torch.nn.modules.module.py`

```python 
class Module(object):
    # [...] omissis
    def __call__(self, *input, **kwargs):
        for hook in self._forward_pre_hooks.values():
            result = hook(self, input)
            if result is not None:
                if not isinstance(result, tuple):
                    result = (result,)
                input = result
        if torch._C._get_tracing_state():
            result = self._slow_forward(*input, **kwargs)
        else:
            result = self.forward(*input, **kwargs)
        for hook in self._forward_hooks.values():
            hook_result = hook(self, input, result)
            if hook_result is not None:
                result = hook_result
        if len(self._backward_hooks) > 0:
            var = result
            while not isinstance(var, torch.Tensor):
                if isinstance(var, dict):
                    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
                else:
                    var = var[0]
            grad_fn = var.grad_fn
            if grad_fn is not None:
                for hook in self._backward_hooks.values():
                    wrapper = functools.partial(hook, self)
                    functools.update_wrapper(wrapper, hook)
                    grad_fn.register_hook(wrapper)
        return result
    
    # [...] omissis
    def forward(self, *input):
        r"""Defines the computation performed at every call.

        Should be overridden by all subclasses.

        .. note::
            Although the recipe for forward pass needs to be defined within
            this function, one should call the :class:`Module` instance afterwards
            instead of this since the former takes care of running the
            registered hooks while the latter silently ignores them.
        """
        raise NotImplementedError
```

**Take away messages** :
1. We don't need to implement the `__call__` method at all in our custom model subclass
2. We don't need to call the `forward` method directly. 
    - We could, but we would miss the flexibility of _forward_ and _backwar_ hooks 

##### Last but not least

```python
>>> optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
```

Being `model` a subclass of `nn.Module`, `model.parameters()` will automatically capture all the `Layers/OP/Tensors/Parameters` that require gradient computation, so to feed to the `autograd` engine during the *backward* (optimisation) step.

###### `model.named_parameters`

In [32]:
for name_str, param in model.named_parameters():
    print("{:21} {:19} {}".format(name_str, str(param.shape), param.numel()))

linear1.weight        torch.Size([100, 1000]) 100000
linear1.bias          torch.Size([100])   100
linear2.weight        torch.Size([10, 100]) 1000
linear2.bias          torch.Size([10])    10


**WAIT**: What happened to `hidden_activation` ?

```python
self.hidden_activation = torch.nn.ReLU()
```

So, it looks that we are registering in the constructor a submodule (`torch.nn.ReLU`) that has no parameters.

Generalising, if we would've had **more** (hidden) layers, it would have required the definition of one of these submodules for each pair of layers (at least).

Looking back at the implementation of the `TwoLayerNet` class as a whole, it looks like a bit of a waste.

**Can we do any better here?** 🤔

---

Well, in this particular case, we could implement the `ReLU` activation _manually_, it is not that difficult, isn't it?

$\rightarrow$ As we already did before, we could use the [`torch.clamp`](https://pytorch.org/docs/stable/torch.html?highlight=clamp#torch.clamp) function

> `torch.clamp`: Clamp all elements in input into the range [ min, max ] and return a resulting tensor

`t.clamp(min=0)` is **exactly** the ReLU that we want.

In [33]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

###### Sorted!

That was easy, wasn't it? **However**, what if we wanted *other* activation functions (e.g. `tanh`, 
`sigmoid`, `LeakyReLU`)?

### Introducing the Functional API

PyTorch has functional counterparts of every `nn` module. 

By _functional_ here we mean "having no internal state", or, in other words, "whose output value is solely and fully determined by the value input arguments". 

Indeed, `torch.nn.functional` provides the many of the same modules we find in `nn`, but with all eventual parameters moved as an argument to the function call. 

For instance, the functional counterpart of `nn.Linear` is `nn.functional.linear`, which is a function that has signature `linear(input, weight, bias=None)`. 

The `weight` and `bias` parameters are **arguments** to the function.

Back to our `TwoLayerNet` model, it makes sense to keep using nn modules for `nn.Linear`, so that our model will be able to manage all of its `Parameter` instances during training. 

However, we can safely switch to the functional counterparts of `nn.ReLU`, since it has no parameters.

In [34]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = torch.nn.functional.relu(self.linear1(x))  # torch.relu would do as well
        y_pred = self.linear2(h_relu)
        return y_pred

In [36]:
model = TwoLayerNet(D_in, H, D_out)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    y_pred = model(x)
    loss = criterion(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 708.2454223632812
50 31.361560821533203
100 1.842212438583374
150 0.19696930050849915
200 0.03158378601074219
250 0.006163737736642361
300 0.0013595600612461567
350 0.00031811968074180186
400 7.707237091381103e-05
450 1.90912360267248e-05


$\rightarrow$ For the curious minds: [The difference and connection between torch.nn and torch.nn.function from relu's various implementations](https://programmer.group/5d5a404b257d7.html)

#### Clever advice and Rule of thumb

> With **quantization**, stateless bits like activations suddenly become stateful because information on the quantization needs to be captured. This means that if we aim to quantize our model, it might be worthwile to stick with the modular API if we go for non-JITed quantization. There is one style matter that will help you avoid surprises with (originally unforseen) uses: if you need several applications of stateless modules (like `nn.HardTanh` or `nn.ReLU`), it is likely a good idea to have a separate instance for each. Re-using the same module appears to be clever and will give correct results with our standard Python usage here, but tools analysing your model might trip over it.

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_ </span>

### Custom Graph flow: Example of Weight Sharing

As we already discussed, the definition of custom `nn.Module` in PyTorch requires the definition of layers (i.e. Parameters) in the constructor (`__init__`), and the implementation of the `forward` method in which the dynamic graph will be traversed defined by the call to each of those layers/parameters.

As an example of **dynamic graphs** we are going to implement a scenario in which we require parameters (i.e. _weights_) sharing between layers.

In order to do so, we will implement a very odd model: a fully-connected ReLU network that on each `forward` call chooses a `random` number (between 1 and 4) and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

In order to do so, we will implement _weight sharing_ among the innermost layers by simply reusing the same `Module` multiple times when defining the forward pass.

In [37]:
import torch

In [38]:
import random


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = torch.relu(self.input_linear(x))
        hidden_layers = random.randint(0, 3)
        for _ in range(hidden_layers):
            h_relu = torch.relu(self.middle_linear(h_relu))
        y_pred = self.output_linear(h_relu)
        return y_pred

In [39]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

In [40]:
for t in range(500):
    for i in range(2):
        start, end = int((N/2)*i), int((N/2)*(i+1))
        x = x[start:end, ...]
        y = y[start:end, ...]
        # Forward pass: Compute predicted y by passing x to the model
        y_pred = model(x)

        # Compute and print loss
        loss = criterion(y_pred, y)
        if t % 50 == 0:
            print(t, loss.item())

        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

0 271.21246337890625
0 0.0
50 0.0
50 0.0
100 0.0
100 0.0
150 0.0
150 0.0
200 0.0
200 0.0
250 0.0
250 0.0
300 0.0
300 0.0
350 0.0
350 0.0
400 0.0
400 0.0
450 0.0
450 0.0


---

### References and Futher Reading:

1. [(*Paper*) PyTorch: An Imperative Style, High-Performance Deep Learning Library](https://papers.nips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf)
2. [PyTorch Autograd Function](https://pytorch.org/docs/stable/autograd.html#function)
3. [(**Terrific**) PyTorch Examples Repo](https://github.com/jcjohnson/pytorch-examples) (*from which, most of the examples in this section have been taken*)