In [1]:
%load_ext notexbook

In [2]:
%texify

# PyTorch `nn` package

### `torch.nn`

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which 
have learnable parameters which will be optimized during learning.

In TensorFlow, packages like **Keras**, (old **TensorFlow-Slim**, and **TFLearn**) provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the `nn` package serves this same purpose. 

The `nn` package defines a set of `Module`s, which are roughly equivalent to neural network layers. 

A `Module` receives input `Tensor`s and computes output `Tensor`s, but may also hold internal state such as `Tensor`s containing learnable parameters. 

The `nn` package also defines a set of useful `loss` functions that are commonly used when 
training neural networks.

In this example we use the `nn` package to implement our two-layer network:

In [1]:
import torch

In [2]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H), # xW+b
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

In [3]:
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

In [5]:
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 639.6167602539062
50 30.772613525390625
100 2.5433220863342285
150 0.4780966639518738
200 0.12993700802326202
250 0.04030954837799072
300 0.013411465100944042
350 0.004547220654785633
400 0.0015493093524128199
450 0.000528861943166703


---

### `torch.optim`

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (**using `torch.no_grad()` or `.data` to avoid tracking history in autograd**). 

This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like `AdaGrad`, `RMSProp`, 
`Adam`.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

Let's finally modify the previous example in order to use `torch.optim` and the `Adam` algorithm:

In [6]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

In [7]:
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [8]:
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 668.5765380859375
50 201.50619506835938
100 49.221805572509766
150 6.844631195068359
200 0.6239314079284668
250 0.0545228011906147
300 0.005567540414631367
350 0.0005590067594312131
400 4.745638943859376e-05
450 3.2170055419555865e-06


##### Model and Optimiser (w/ Parameters) at a glance

![model_and_optimiser](imgs/model_optim.png)

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_

### Can we do better ?

---

##### The Learning Process

![learning process sketch](./imgs/learning_process.png)

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_ </span>

Possible scenarios:

- Specify models that are more complex than a sequence of existing (pre-defined) modules;
- Customise the learning procedure (e.g. _weight sharing_ ?)
- ?

For these cases, **PyTorch** allows to define our own custom modules by subclassing `nn.Module` and defining a `forward` method which receives the input data (i.e. `Tensor`) and returns the output (i.e. `Tensor`).

It is in the `forward` method that **all** the _magic_ of Dynamic Graph and `autograd` operations happen!

### PyTorch: Custom Modules

 Let's implement our **two-layers** model as a custom `nn.Module` subclass

In [9]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.hidden_activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        l1 = self.linear1(x)
        h_relu = self.hidden_activation(l1)
        y_pred = self.linear2(h_relu)
        return y_pred

In [10]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [11]:
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

In [12]:
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

In [13]:
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 693.8279418945312
50 33.77482986450195
100 2.100647449493408
150 0.24654023349285126
200 0.04137988016009331
250 0.008377175778150558
300 0.0019239461980760098
350 0.0004813372506760061
400 0.00012718062498606741
450 3.48151006619446e-05


#### What happened really? Let's have a closer look

```python
>>> model = TwoLayerNet(D_in, H, D_out)
```

This calls `TwoLayerNet.__init__` **constructor** method (_implementation reported below_ ):

```python
def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.hidden_activation = torch.nn.ReLU()
    self.linear2 = torch.nn.Linear(H, D_out)
```

1. First thing, we call the `nn.Module` constructor which sets up the housekeeping
    - If you forget to do that, you will get and error message reminding that you should call it before using any `nn.Module` capabilities
2. We create a class attribute for each layer (`OP/Tensor/`) that we intend to include in our model
    - These can be also `Sequential` as in _Submodules_ or *Block of Layers*
    - **Note**: We are **not** defining the Graph yet, just the layer!

```python
>>> y_pred = model(x)
```

1. First thing to notice: the `model` object is **callable**
   - It means `nn.Module` is implementing a `__call__` method
   - We **don't** need to re-implement that!
   
2. (in fact) The `nn.Module` class will call `self.forward` - in a [Template Method Pattern](https://en.wikipedia.org/wiki/Template_method_pattern) fashion
    - for this reason, we have to define the `forward` method
    - (needless to say) the `forward` method implements the **forward** pass of our model

`from torch.nn.modules.module.py`

```python 
class Module(object):
    # [...] omissis
    def __call__(self, *input, **kwargs):
        for hook in self._forward_pre_hooks.values():
            result = hook(self, input)
            if result is not None:
                if not isinstance(result, tuple):
                    result = (result,)
                input = result
        if torch._C._get_tracing_state():
            result = self._slow_forward(*input, **kwargs)
        else:
            result = self.forward(*input, **kwargs)
        for hook in self._forward_hooks.values():
            hook_result = hook(self, input, result)
            if hook_result is not None:
                result = hook_result
        if len(self._backward_hooks) > 0:
            var = result
            while not isinstance(var, torch.Tensor):
                if isinstance(var, dict):
                    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
                else:
                    var = var[0]
            grad_fn = var.grad_fn
            if grad_fn is not None:
                for hook in self._backward_hooks.values():
                    wrapper = functools.partial(hook, self)
                    functools.update_wrapper(wrapper, hook)
                    grad_fn.register_hook(wrapper)
        return result
    
    # [...] omissis
    def forward(self, *input):
        r"""Defines the computation performed at every call.

        Should be overridden by all subclasses.

        .. note::
            Although the recipe for forward pass needs to be defined within
            this function, one should call the :class:`Module` instance afterwards
            instead of this since the former takes care of running the
            registered hooks while the latter silently ignores them.
        """
        raise NotImplementedError
```

**Take away messages** :
1. We don't need to implement the `__call__` method at all in our custom model subclass
2. We don't need to call the `forward` method directly. 
    - We could, but we would miss the flexibility of _forward_ and _backwar_ hooks 

##### Last but not least

```python
>>> optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
```

Being `model` a subclass of `nn.Module`, `model.parameters()` will automatically capture all the `Layers/OP/Tensors/Parameters` that require gradient computation, so to feed to the `autograd` engine during the *backward* (optimisation) step.

###### `model.named_parameters`

In [14]:
for name_str, param in model.named_parameters():
    print("{:21} {:19} {}".format(name_str, str(param.shape), param.numel()))

linear1.weight        torch.Size([100, 1000]) 100000
linear1.bias          torch.Size([100])   100
linear2.weight        torch.Size([10, 100]) 1000
linear2.bias          torch.Size([10])    10


**WAIT**: What happened to `hidden_activation` ?

```python
self.hidden_activation = torch.nn.ReLU()
```

So, it looks that we are registering in the constructor a submodule (`torch.nn.ReLU`) that has no parameters.

Generalising, if we would've had **more** (hidden) layers, it would have required the definition of one of these submodules for each pair of layers (at least).

Looking back at the implementation of the `TwoLayerNet` class as a whole, it looks like a bit of a waste.

**Can we do any better here?** 🤔

---

Well, in this particular case, we could implement the `ReLU` activation _manually_, it is not that difficult, isn't it?

$\rightarrow$ As we already did before, we could use the [`torch.clamp`](https://pytorch.org/docs/stable/torch.html?highlight=clamp#torch.clamp) function

> `torch.clamp`: Clamp all elements in input into the range [ min, max ] and return a resulting tensor

`t.clamp(min=0)` is **exactly** the ReLU that we want.

In [15]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

###### Sorted!

That was easy, wasn't it? **However**, what if we wanted *other* activation functions (e.g. `tanh`, 
`sigmoid`, `LeakyReLU`)?

### Introducing the Functional API

PyTorch has functional counterparts of every `nn` module. 

By _functional_ here we mean "having no internal state", or, in other words, "whose output value is solely and fully determined by the value input arguments". 

Indeed, `torch.nn.functional` provides the many of the same modules we find in `nn`, but with all eventual parameters moved as an argument to the function call. 

For instance, the functional counterpart of `nn.Linear` is `nn.functional.linear`, which is a function that has signature `linear(input, weight, bias=None)`. 

The `weight` and `bias` parameters are **arguments** to the function.

Back to our `TwoLayerNet` model, it makes sense to keep using nn modules for `nn.Linear`, so that our model will be able to manage all of its `Parameter` instances during training. 

However, we can safely switch to the functional counterparts of `nn.ReLU`, since it has no parameters.

In [16]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = torch.nn.functional.relu(self.linear1(x))  # torch.relu would do as well
        y_pred = self.linear2(h_relu)
        return y_pred

In [None]:
from torch.nn import functional as F

F.relu(self.linear)

In [17]:
model = TwoLayerNet(D_in, H, D_out)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    y_pred = model(x)
    loss = criterion(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 691.5662841796875
50 38.10184097290039
100 2.986912727355957
150 0.27371305227279663
200 0.0322413295507431
250 0.00451896945014596
300 0.000707552710082382
350 0.00011905599967576563
400 2.109917659254279e-05
450 3.876339633279713e-06


$\rightarrow$ For the curious minds: [The difference and connection between torch.nn and torch.nn.function from relu's various implementations](https://programmer.group/5d5a404b257d7.html)

#### Clever advice and Rule of thumb

> With **quantization**, stateless bits like activations suddenly become stateful because information on the quantization needs to be captured. This means that if we aim to quantize our model, it might be worthwile to stick with the modular API if we go for non-JITed quantization. There is one style matter that will help you avoid surprises with (originally unforseen) uses: if you need several applications of stateless modules (like `nn.HardTanh` or `nn.ReLU`), it is likely a good idea to have a separate instance for each. Re-using the same module appears to be clever and will give correct results with our standard Python usage here, but tools analysing your model might trip over it.

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_ </span>

### Custom Graph flow: Example of Weight Sharing

As we already discussed, the definition of custom `nn.Module` in PyTorch requires the definition of layers (i.e. Parameters) in the constructor (`__init__`), and the implementation of the `forward` method in which the dynamic graph will be traversed defined by the call to each of those layers/parameters.

As an example of **dynamic graphs** we are going to implement a scenario in which we require parameters (i.e. _weights_) sharing between layers.

In order to do so, we will implement a very odd model: a fully-connected ReLU network that on each `forward` call chooses a `random` number (between 1 and 4) and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

In order to do so, we will implement _weight sharing_ among the innermost layers by simply reusing the same `Module` multiple times when defining the forward pass.

In [24]:
import torch

In [42]:
import random


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = torch.relu(self.input_linear(x))
        hidden_layers = random.randint(0, 3)
        for _ in range(hidden_layers):
            h_relu = torch.relu(self.middle_linear(h_relu))
        y_pred = self.output_linear(h_relu)
        return y_pred

In [43]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

In [44]:
for t in range(500):
    for i in range(2):
        start, end = int((N/2)*i), int((N/2)*(i+1))
        x = x[start:end, ...]
        y = y[start:end, ...]
        # Forward pass: Compute predicted y by passing x to the model
        y_pred = model(x)

        # Compute and print loss
        loss = criterion(y_pred, y)
        if t % 50 == 0:
            print(t, loss.item())

        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

0 319.9242858886719
0 0.0
50 0.0
50 0.0
100 0.0
100 0.0
150 0.0
150 0.0
200 0.0
200 0.0
250 0.0
250 0.0
300 0.0
300 0.0
350 0.0
350 0.0
400 0.0
400 0.0
450 0.0
450 0.0


### Latest from the `torch` ecosystem

* $\rightarrow$: [Migration from Chainer to PyTorch](https://medium.com/pytorch/migration-from-chainer-to-pytorch-8ed92c12c8)

* $\rightarrow$: [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/introduction_guide.html)
    - [fast.ai](https://docs.fast.ai/)

---
### References and Futher Reading:

1. [Deep Learning with PyTorch, Luca Antiga et. al.](https://www.manning.com/books/deep-learning-with-pytorch)
2. [(**Terrific**) PyTorch Examples Repo](https://github.com/jcjohnson/pytorch-examples) (*where most of the examples in this notebook have been adapted from*)