# PyTorch `nn` package

### The Learning Process of a Neural Network

![learning process sketch](./learning_process.png)

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_ </span>

## Torch `autograd`

**TL;DR**:

(doc. reference: [torch autograd](https://pytorch.org/docs/stable/autograd.html))

`torch.autograd` provides classes and functions implementing automatic differentiation of arbitrary scalar valued functions. 

It requires minimal changes to the existing code - you only need to declare `Tensor`s for which gradients should be computed with the `requires_grad=True` keyword.

#### A quick look at the backend: Automatic Differentiation

**TL; DR**: Automatic Differentiation lets you compute **exact** derivatives in **constant time**

###### Teaser

Automatic Differentiation is the secret sauce that powers most of the existing Deep Learning frameworks (e.g. Pytorch or TensorFlow). 

In a nutshell, Deep learnin frameworks provide the (technical) infrastructure in which computing the derivative of a function takes as much time as evaluating the function. In particular, the design idea is: "you define a network with a loss function, and you get a gradient *for free*".


**Differentiation** in general is becoming a **first class citizen** in programming languages with early work started by Chris Lattner of LLVM famework — see the [Differentiable Programming Manifesto](https://github.com/apple/swift/blob/master/docs/DifferentiableProgramming.md) for more detail.

However, if you're still wondering whether any of this is just boring math and how this relates to computer programming, I definitely suggest to have a look at **this** vide0 on YouTube:

[GOTO 2018 - Machine Learning: Alchemy for the Modern Computer Scientist by Erik Meijer](https://www.youtube.com/watch?v=Rs0uRQJdIcg)

###### Example

Rather than talking about large neural networks, we will seek to understand automatic differentiation via a small problem borrowed from the book of *Griewank and Walther (2008)*.

In the following we will adopt their very same **three-part** notation (also used in [4]).

A function $f: \mathbb{R^n} \mapsto \mathbb{R^m}$ is constructed using intermediate variables $v_i$ such that:

- variables $v_{i-n} = x_i$, $i = 1,\ldots,n$ are the input variables;
- variables $v_i$, $i = 1,\ldots,l$ are the working **intermediate** variables;
- variables $y_{m-i} = v_{l-i}$, $i = m-1,\ldots,0$ are the output variables.

<img src="ad_example.png" class="maxw80" />

The **traversal** of the graph and the **direction** in which gradients are actually computed defines the two modalities of AD:

* **forward mode** AD;
* **backward mode** AD.

###### Forward (Tangent) Mode

The idea of forward mode automatic differentiation is that we can compute derivatives as we go and
that the chain rule says the overall derivative that we want is a composition of these incremental
computations.

Let’s imagine that our overall goal is to compute $\frac{\partial y}{\partial x_1}$ for the example above.

We denote all the intermediate *partial derivatives* (**with respect to $x_1$**) as:
$$
\dot{v}_{i} = \frac{\partial v_i}{\partial x_1}
$$

The **very same** applies when we want to compute $\frac{\partial y}{\partial x_2}$

<img src="forward_ad.png" class="maxw80" />

The interesting bit is that we can implement this bookkeeping during the *execution trace* just via abstraction.

We can replace our floating point numbers with `tuples`, and replace primitive functions with the following Python implementation (just using `numpy`)

In [1]:
import numpy as np

def add(atuple, btuple):
    (a, adot) = atuple
    (b, bdot) = btuple
    return ( a + b, adot + bdot)

def subtract(atuple, btuple): 
    (a, adot) = atuple
    (b, bdot) = btuple
    return (a - b, adot - bdot)

def multiply(atuple, btuple):
    (a, adot) = atuple
    (b, bdot) = btuple
    return (a * b, adot * b + bdot * a)

def divide(atuple, btuple):
    (a, adot) = atuple
    (b, bdot) = btuple
    return (a / b, (adot * b - bdot * a) / (b*b))

def ln(atuple):
    (a, adot) = atuple
    return (np.log(a), (1/a)*adot)

def sin(atuple):
    (a, adot) = atuple
    return (np.sin(a), np.cos(a)*adot)

In [2]:
def f(x1: tuple, x2: tuple):
    # ln(x1) + x1x2 - sin(x2)
    v1 = ln(x1)
    v2 = multiply(x1, x2)
    v3 = add(v1, v2)
    v4 = sin(x2)
    v5 = subtract(v3, v4)
    return v5

In [3]:
f(x1=(2, 1), x2=(5, 0))

(11.652071455223084, 5.5)

In [4]:
f(x1=(2, 0), x2=(5, 1))

(11.652071455223084, 1.7163378145367738)

###### Reverse (Co-Tangent) Mode

AD in the reverse accumulation mode corresponds to a generalized backpropagation algorithm, in that it propagates derivatives backward from a given output. This is done by complementing each intermediate variable $v_i$ with an **adjoint**:
$$
\bar{v}_{i} = \frac{\partial y_i}{\partial v_i} = \displaystyle{\sum_{j:\text{child of i}} \bar{vj} \frac{\partial v_j}{\partial v_i}}
$$

<img src="backward_ad.png" class="maxw85" />

<img src="https://github.com/google/tangent/raw/master/docs/toolspace.png" />

There are various ways to implement this abstraction in its full generality, but an implementation requires more code than can easily appear here. The three major approaches are:

**source code transformation**: The adjoint backward pass code is generated a priori from the forward computation. A clean Python example of such a system is [**Tangent**](https://colab.research.google.com/drive/1cjoX9GteBymbnqcikNMZP1uenMcwAGDe).

**graph-based**: This approach uses an embedded mini-language to specify a graph of computations that can then be manipulated for function evaluations and gradients. 

$\rightarrow$ The advantage of this approach is that it is amenable to intelligent graph optimizations and use of compilers. The embedded mini-language also makes it possible to build specialized hardware that targets the differentiable primitives. 

$\rightarrow$ The downside of this approach is that you are not coding in the host language (e.g., Python) and so you can’t take advantage of its imperative design and control flow. Generally the mini-language is less expressive than the host language. Also, the lazy execution of the function represented by the graph can make it difficult to debug. TensorFlow 1.x is an example of this kind of automatic differentiation.

**tape-based**: This approach tracks the actual composed functions as they are called during execution of the forward pass. One name for this data structure is the *Wengert list*. 

$\rightarrow$ With the ordered sequence of computations in hand, it is then possible to walk backward through the list to compute the gradient. 

$\rightarrow$ The advantage of this is that it can more easily use all the features of the host language and the imperative execution is easier to understand. 

$\rightarrow$ The downside is that it can be more difficult to optimize the code and reuse computations across executions. 

[Autograd](https://github.com/HIPS/autograd) is an example of this. 
The automatic differentiation in [PyTorch](https://pytorch.org/) also roughly follows this model.

###### References and Futher Reading

1. [Deep Learning with PyTorch (**free sample**) - Luca Antiga et. al.](https://pytorch.org/deep-learning-with-pytorch)

4. [(*Paper*) Automatic Differentiation in Machine Learning: a Survey](https://arxiv.org/abs/1502.05767)

---

### `torch.nn` in a Nutshell

Computational graphs and autograd are a very powerful paradigm for defining complex operators and automatically taking derivatives; however for large neural networks raw autograd can be a bit too low-level.

When building neural networks we frequently think of arranging the computation into layers, some of which 
have learnable parameters which will be optimized during learning.

In TensorFlow, packages like **Keras**, (old **TensorFlow-Slim**, and **TFLearn**) provide higher-level abstractions over raw computational graphs that are useful for building neural networks.

In PyTorch, the `nn` package serves this same purpose. 

The `nn` package defines a set of `Module`s, which are roughly equivalent to neural network layers. 

A `Module` receives input `Tensor`s and computes output `Tensor`s, but may also hold internal state such as `Tensor`s containing learnable parameters. 

The `nn` package also defines a set of useful `loss` functions that are commonly used when 
training neural networks.

##### PyTorch Examples

The following examples have been extracted from the [PyTorch Examples Repository](https://github.com/jcjohnson/pytorch-examples) by `@jcjohnson`

In this example we use the `nn` package to implement our two-layer network:

In [5]:
import torch

In [6]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

In [7]:
# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4

In [8]:
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

0 642.0931396484375
50 39.40785217285156
100 2.866370439529419
150 0.3431205451488495
200 0.05561559647321701
250 0.010967700742185116
300 0.002458262722939253
350 0.0005952278152108192
400 0.00015158375026658177
450 3.962670962209813e-05


### `torch.optim`

Up to this point we have updated the weights of our models by manually mutating the Tensors holding learnable parameters (**using `torch.no_grad()` or `.data` to avoid tracking history in autograd**). 

This is not a huge burden for simple optimization algorithms like stochastic gradient descent, but in practice we often train neural networks using more sophisticated optimizers like `AdaGrad`, `RMSProp`, 
`Adam`.

The optim package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

Let's finally modify the previous example in order to use `torch.optim` and the `Adam` algorithm:

##### Model and Optimiser (w/ Parameters) at a glance

![model_and_optimiser](model_optim.png)

<span class="fn"><i>Source:</i> [1] - _Deep Learning with PyTorch_

In [9]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

In [10]:
# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

In [11]:
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

0 702.6163940429688
50 219.6895751953125
100 61.97289276123047
150 11.652957916259766
200 1.4025999307632446
250 0.1275288462638855
300 0.009245814755558968
350 0.0005796181503683329
400 3.2619711419101804e-05
450 1.6339765807060758e-06


### Can we do better ?

---

Possible scenario:

- Specify models that are more complex than a sequence of existing (pre-defined) modules;
- Customise the learning procedure (e.g. _weight sharing_ ?)
- ?

For these cases, **PyTorch** allows to define our own custom modules by subclassing `nn.Module` and defining a `forward` method which receives the input data (i.e. `Tensor`) and returns the output (i.e. `Tensor`).

It is in the `forward` method that **all** the _magic_ of Dynamic Graph and `autograd` operations happen!

### PyTorch: Custom Modules

 Let's implement our **two-layers** model as a custom `nn.Module` subclass

In [12]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.hidden_activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        l1 = self.linear1(x)
        h_relu = self.hidden_activation(l1)
        y_pred = self.linear2(h_relu)
        return y_pred

In [13]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [14]:
# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

In [15]:
# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

In [16]:
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 716.993896484375
50 28.708864212036133
100 1.3810720443725586
150 0.1258920580148697
200 0.016454389318823814
250 0.0026553189381957054
300 0.0004969889996573329
350 0.00010043554357253015
400 2.1219193513388745e-05
450 4.619778337655589e-06


#### What happened really? Let's have a closer look

```python
>>> model = TwoLayerNet(D_in, H, D_out)
```

This calls `TwoLayerNet.__init__` **constructor** method (_implementation reported below_ ):

```python
def __init__(self, D_in, H, D_out):
    """
    In the constructor we instantiate two nn.Linear modules and assign them as
    member variables.
    """
    super(TwoLayerNet, self).__init__()
    self.linear1 = torch.nn.Linear(D_in, H)
    self.hidden_activation = torch.nn.ReLU()
    self.linear2 = torch.nn.Linear(H, D_out)
```

1. First thing, we call the `nn.Module` constructor which sets up the housekeeping
    - If you forget to do that, you will get and error message reminding that you should call it before using any `nn.Module` capabilities
2. We create a class attribute for each layer (`OP/Tensor/`) that we intend to include in our model
    - These can be also `Sequential` as in _Submodules_ or *Block of Layers*
    - **Note**: We are **not** defining the Graph yet, just the layer!

```python
>>> y_pred = model(x)
```

1. First thing to notice: the `model` object is **callable**
   - It means `nn.Module` is implementing a `__call__` method
   - We **don't** need to re-implement that!
   
2. (in fact) The `nn.Module` class will call `self.forward` - in a [Template Method Pattern](https://en.wikipedia.org/wiki/Template_method_pattern) fashion
    - for this reason, we have to define the `forward` method
    - (needless to say) the `forward` method implements the **forward** pass of our model

`from torch.nn.modules.module.py`

```python 
class Module(object):
    # [...] omissis
    def __call__(self, *input, **kwargs):
        for hook in self._forward_pre_hooks.values():
            result = hook(self, input)
            if result is not None:
                if not isinstance(result, tuple):
                    result = (result,)
                input = result
        if torch._C._get_tracing_state():
            result = self._slow_forward(*input, **kwargs)
        else:
            result = self.forward(*input, **kwargs)
        for hook in self._forward_hooks.values():
            hook_result = hook(self, input, result)
            if hook_result is not None:
                result = hook_result
        if len(self._backward_hooks) > 0:
            var = result
            while not isinstance(var, torch.Tensor):
                if isinstance(var, dict):
                    var = next((v for v in var.values() if isinstance(v, torch.Tensor)))
                else:
                    var = var[0]
            grad_fn = var.grad_fn
            if grad_fn is not None:
                for hook in self._backward_hooks.values():
                    wrapper = functools.partial(hook, self)
                    functools.update_wrapper(wrapper, hook)
                    grad_fn.register_hook(wrapper)
        return result
    
    # [...] omissis
    def forward(self, *input):
        r"""Defines the computation performed at every call.

        Should be overridden by all subclasses.

        .. note::
            Although the recipe for forward pass needs to be defined within
            this function, one should call the :class:`Module` instance afterwards
            instead of this since the former takes care of running the
            registered hooks while the latter silently ignores them.
        """
        raise NotImplementedError
```

**Take away messages** :
1. We don't need to implement the `__call__` method at all in our custom model subclass
2. We don't need to call the `forward` method directly. 
    - We could, but we would miss the flexibility of _forward_ and _backwar_ hooks 

##### Last but not least

```python
>>> optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
```

Being `model` a subclass of `nn.Module`, `model.parameters()` will automatically capture all the `Layers/OP/Tensors/Parameters` that require gradient computation, so to feed to the `autograd` engine during the *backward* (optimisation) step.

###### `model.named_parameters`

In [17]:
for name_str, param in model.named_parameters():
    print("{:21} {:19} {}".format(name_str, str(param.shape), param.numel()))

linear1.weight        torch.Size([100, 1000]) 100000
linear1.bias          torch.Size([100])   100
linear2.weight        torch.Size([10, 100]) 1000
linear2.bias          torch.Size([10])    10


**WAIT**: What happened to `hidden_activation` ?

```python
self.hidden_activation = torch.nn.ReLU()
```

So, it looks that we are registering in the constructor a submodule (`torch.nn.ReLU`) that has no parameters.

Generalising, if we would've had **more** (hidden) layers, it would have required the definition of one of these submodules for each pair of layers (at least).

Looking back at the implementation of the `TwoLayerNet` class as a whole, it looks like a bit of a waste.

**Can we do any better here?** 🤔

---

Well, in this particular case, we could implement the `ReLU` activation _manually_, it is not that difficult, isn't it?

$\rightarrow$ As we already did before, we could use the [`torch.clamp`](https://pytorch.org/docs/stable/torch.html?highlight=clamp#torch.clamp) function

> `torch.clamp`: Clamp all elements in input into the range [ min, max ] and return a resulting tensor

`t.clamp(min=0)` is **exactly** the ReLU that we want.

In [18]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

###### Sorted!

That was easy, wasn't it? **However**, what if we wanted *other* activation functions (e.g. `tanh`, 
`sigmoid`, `LeakyReLU`)?

##### Introducing the Functional API

PyTorch has functional counterparts of every `nn` module. 

By _functional_ here we mean "having no internal state", or, in other words, "whose output value is solely and fully determined by the value input arguments". 

Indeed, `torch.nn.functional` provides the many of the same modules we find in `nn`, but with all eventual parameters moved as an argument to the function call. 

For instance, the functional counterpart of `nn.Linear` is `nn.functional.linear`, which is a function that has signature `linear(input, weight, bias=None)`. 

The `weight` and `bias` parameters are **arguments** to the function.

Back to our `TwoLayerNet` model, it makes sense to keep using nn modules for `nn.Linear`, so that our model will be able to manage all of its `Parameter` instances during training. 

However, we can safely switch to the functional counterparts of `nn.ReLU`, since it has no parameters.

In [19]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = torch.nn.functional.relu(self.linear1(x))  # torch.relu would do as well
        y_pred = self.linear2(h_relu)
        return y_pred

In [20]:
model = TwoLayerNet(D_in, H, D_out)
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    y_pred = model(x)
    loss = criterion(y_pred, y)
    if t % 50 == 0:
        print(t, loss.item())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 698.4146728515625
50 34.69792556762695
100 2.164041042327881
150 0.23706716299057007
200 0.03495536744594574
250 0.006048018112778664
300 0.001142494729720056
350 0.00022856226132716984
400 4.769738734466955e-05
450 1.0313570783182513e-05


$\rightarrow$ For the curious minds: [The difference and connection between torch.nn and torch.nn.function from relu's various implementations](https://programmer.group/5d5a404b257d7.html)

###### Latest from the `torch` ecosystem

* $\rightarrow$: [Migration from Chainer to PyTorch](https://medium.com/pytorch/migration-from-chainer-to-pytorch-8ed92c12c8)

* $\rightarrow$: [PyTorch Lightning](https://pytorch-lightning.readthedocs.io/en/latest/introduction_guide.html)
    - [fast.ai](https://docs.fast.ai/)

### (Collaborative)

_Implementing the Perceptron_ in PyTorch

<img src="https://raw.githubusercontent.com/rasbt/python-machine-learning-book-3rd-edition/master/ch02/images/02_09.png" />

Perceptron (Adaline algorithm) - **ADAptive LInear Neuron** main features:

- Unit Step function
- Binary Classification 
- Linear Model
- Weight Update rule with `learning rate` and Gradient Descent Optimisation

$
    {\bf w} = {\bf w} + \Delta {\bf w} \text{, where } \Delta {\bf w} = -\eta \nabla J({\bf w}) 
$ [$^2$](#fn2)

<span id="fn2">2: ${\bf w}$ (in bold) refers to the **weights vector**. </span>

In [21]:
import torch
import torch.nn as nn

class Perceptron(nn.Module):
    """ A perceptron is one linear layer """
    def __init__(self, input_dim):
        """ Args:
        input_dim (int): size of the input features
        """
        super(Perceptron, self).__init__() self.fc1 = nn.Linear(input_dim, 1)
        
    def forward(self, x_in):
        """The forward pass of the perceptron
        Args:
        x_in (torch.Tensor): an input data tensor
        x_in.shape should be (batch, num_features) Returns:
        the resulting tensor. tensor.shape should be (batch,).
        """
        return torch.sigmoid(self.fc1(x_in)).squeeze()

SyntaxError: invalid syntax (<ipython-input-21-05aa2a3b7b29>, line 10)

### A review of the main `activation` functions

###### Sigmoid

$$
f(x) = \frac{1}{1 + e^{-x}}
$$

In [None]:
import torch
import matplotlib.pyplot as plt
x = torch.arange(-5., 5., 0.1) 

y = torch.sigmoid(x) 

plt.plot(x.numpy(), y.numpy()) 
plt.show()

###### Tanh

$$
f(x) = tanh(x) = \frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}
$$

In [None]:
import torch
import matplotlib.pyplot as plt
x = torch.arange(-5., 5., 0.1) 

y = torch.tanh(x) 

plt.plot(x.numpy(), y.numpy()) 
plt.show()

###### ReLU

In [None]:
import torch
import matplotlib.pyplot as plt
x = torch.arange(-5., 5., 0.1) 

y = torch.nn.functional.relu(x) 

plt.plot(x.numpy(), y.numpy()) 
plt.show()

$$
f(x) = \max(x, \alpha x)
$$

In [None]:
import torch
import matplotlib.pyplot as plt

prelu = torch.nn.PReLU(num_parameters=1)
x = torch.arange(-5., 5., 0.1) 

y = prelu(x) 

plt.plot(x.detach().numpy(), y.detach().numpy()) 
plt.show()

In [None]:
import torch.nn as nn 
import torch

softmax = nn.Softmax(dim=1) 

x_input = torch.randn(1, 3) 
y_output = softmax(x_input) 

print(x_input)
print(y_output) 

print(torch.sum(y_output, dim=1))

###### Loss Functions

```python
CrossEntropyLoss == LogSoftmax + NLLLoss
BCEWithLogits == LogSigmoid + NLLLoss
MSELoss(reduce=sum) == SSE
```

---
### References and Futher Reading:

1. [Deep Learning with PyTorch, Luca Antiga et. al.](https://www.manning.com/books/deep-learning-with-pytorch)
2. [(**Terrific**) PyTorch Examples Repo](https://github.com/jcjohnson/pytorch-examples) (*where most of the examples in this notebook have been adapted from*)