## Deep Neural Networks with Pytorch

In [None]:
import torch
from torch import nn
from torch.nn import functional as F

In [10]:
X = torch.rand(2, 20)

### 1. Layers and Modules

In [11]:
class MLP(nn.Module):
    def __init__(self):
        # Call the constructor of the parent class nn.Module to perform
        # the necessary initialization
        super().__init__()
        self.hidden = nn.LazyLinear(256)
        self.out = nn.LazyLinear(10)
        
    # Define the forward propagation of the model, that is, how to return the
    # required model output based on the input X
    def forward(self, X):
        return self.out(F.relu(self.hidden(X)))

#### How to integrate arbitrary code into the flow of your neural network computations.

In [12]:
class FixedHiddenMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Random weight parameters that will not compute gradients and
        # therefore keep constant during training
        self.rand_weight = torch.rand((20, 20))
        self.linear = nn.LazyLinear(20)
        
    def forward(self, X):
        X = self.linear(X)
        X = F.relu(X @ self.rand_weight + 1)
        # Reuse the fully connected layer. This is equivalent to sharing
        # parameters with two fully connected layers
        X = self.linear(X)
        # Control flow
        while X.abs().sum() > 1:
            X /= 2
        return X.sum()

#### Nest modules in some creative ways

In [13]:
class NestMLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(nn.LazyLinear(64), nn.ReLU(),
                                 nn.LazyLinear(32), nn.ReLU())
        self.linear = nn.LazyLinear(16)
        
    def forward(self, X):
        return self.linear(self.net(X))


chimera = nn.Sequential(NestMLP(), nn.LazyLinear(20), FixedHiddenMLP())
chimera(X)

tensor(0.2701, grad_fn=<SumBackward0>)

### 2. Parameter Management

In [14]:
net = nn.Sequential(nn.LazyLinear(8),
                    nn.ReLU(),
                    nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

#### Parameter Access

We can inspect the parameters of the second fully connected layer as follows.


In [17]:
net[2].state_dict()

OrderedDict([('weight',
              tensor([[ 0.3216, -0.0429,  0.2595, -0.2796,  0.0971, -0.0296, -0.3129, -0.1053]])),
             ('bias', tensor([-0.2723]))])

In [18]:
net[2].bias.data

tensor([-0.2723])

In [19]:
[(name, param.shape) for name, param in net.named_parameters()]

[('0.weight', torch.Size([8, 4])),
 ('0.bias', torch.Size([8])),
 ('2.weight', torch.Size([1, 8])),
 ('2.bias', torch.Size([1]))]

#### Tied Parameters

In [20]:
# We need to give the shared layer a name so that we can refer to its
# parameters
shared = nn.LazyLinear(8)
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(),
                    shared, nn.ReLU(),
                    shared, nn.ReLU(),
                    nn.LazyLinear(1))
net(X)
# Check whether the parameters are the same
print(net[2].weight.data[0] == net[4].weight.data[0])
net[2].weight.data[0, 0] = 100
# Make sure that they are actually the same object rather than just having the
# same value
print(net[2].weight.data[0] == net[4].weight.data[0])

tensor([True, True, True, True, True, True, True, True])
tensor([True, True, True, True, True, True, True, True])


when parameters are tied what happens to the gradients? Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are added together during backpropagation.

### 3. Parameter Initialization

- By default, PyTorch initializes weight and bias matrices **uniformly** by drawing from a range that is computed according to the input and output dimension. 

- **PyTorch’s nn.init module** provides a variety of preset initialization methods.

In [21]:
import torch
from torch import nn

In [22]:
net = nn.Sequential(nn.LazyLinear(8), nn.ReLU(), nn.LazyLinear(1))
X = torch.rand(size=(2, 4))
net(X).shape

torch.Size([2, 1])

- The code below initializes all weight parameters as Gaussian random variables with standard deviation 0.01, while bias parameters are cleared to zero.

In [23]:
def init_normal(module):
    if type(module) == nn.Linear:
        nn.init.normal_(module.weight, mean=0, std=0.01)
        nn.init.zeros_(module.bias)
        
net.apply(init_normal)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([ 0.0049, -0.0119, -0.0006,  0.0060]), tensor(0.))

- We can also initialize all the parameters to a given constant value (say, 1).

In [24]:
def init_constant(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 1)
        nn.init.zeros_(module.bias)
        
net.apply(init_constant)
net[0].weight.data[0], net[0].bias.data[0]

(tensor([1., 1., 1., 1.]), tensor(0.))

#### Built-in Initialization

#### Xavier Initialization

- The Xavier initialization samples weights from a Gaussian distribution with **zero mean** and **variance** $\sigma^{2} = \frac{2}{n_{in}+n_{out}}. $

- Below we initialize the first layer with the Xavier initializer and initialize the second layer to a constant value of 42.

In [26]:
def init_xavier(module):
    if type(module) == nn.Linear:
        nn.init.xavier_uniform_(module.weight)
        
def init_42(module):
    if type(module) == nn.Linear:
        nn.init.constant_(module.weight, 42)
        
net[0].apply(init_xavier)
net[2].apply(init_42)
print(net[0].weight.data[0])
print(net[2].weight.data)

tensor([-0.2814, -0.1015, -0.5769,  0.4441])
tensor([[42., 42., 42., 42., 42., 42., 42., 42.]])


### 4. Activation Functions

#### Sigmoid Function

- The sigmoid function transforms those inputs whose values lie in the domain $R$, to outputs that lie on the interval (0, 1)
  
$sigmoid(x) = \frac{1}{1 + exp(-x)} $


#### Hyperbolic tangent (Tanh) Function

- the tanh (hyperbolic tangent) function squashes its inputs, transforming them into elements on the interval between −1 and 1
  
$tanh(x) = \frac{1 − exp(−2x)} {1 + exp(−2x)}$


#### Rectified linear unit (ReLU) Function

- the ReLU function retains only positive elements and discards all negative elements by setting the corresponding activations to 0

  
$ReLU(x) = max(x, 0).$

#### Leaky ReLU function


<img src="img/lrelu.png" width=200 height=200 />

- For $\alpha \in (0, 1)$, leaky ReLU is a nonlinear function that give a non-zero output for a negative input


#### GELU (Gaussian Error Linear Unit)

GELU(x) = $x * Φ(x)$

where $Φ(x)$ is the Cumulative Distribution Function for Gaussian Distribution.

$GELU(x)=0.5* x * (1+Tanh(\sqrt{\frac{2}{π}}*(x+0.044715 * x^{3})))$

#### Swish function

- Applies the Sigmoid Linear Unit (SiLU) function, element-wise.

The SiLU function is also known as the swish function.

silu(x) $= x ∗ \sigma(x)$


where $\sigma(x)$ is the logistic sigmoid.

### 5. Optimization

#### Adam

- It uses exponential weighted moving averages (also known as leaky averaging) to obtain an estimate of both the momentum and also the second moment of the gradient

$v_{t} ← \beta_{1} v_{t-1}+ (1 − \beta_{1}) g_{t}, $

$ s_{t} ← \beta_{2} s_{t-1} + ( 1 − \beta_{2} ) g_{t}^{2} $

Here $\beta_{1}$ and $\beta_{2}$ are nonnegative weighting parameters. Common choices for them are
$\beta_{1}$ = 0.9 and $\beta_{2}$ = 0.999

Then we rescale the gradient 

$g_{t}^{'} = \frac{\lambda \hat{v_{t}}}{\sqrt{\hat{s_{t}}} + \epsilon}$, where

$ v^{'} = \frac{v_{t}}{1 - \beta_{1}^{t}}$

$ s^{'} = \frac{s_{t}}{1 - \beta_{2}^{t}}$

$x_{t} ← x_{t-1} − g_{t}^{'}$.

- For gradients with significant variance we may encounter issues with convergence. They can be amended by using larger minibatches or by switching to an improved estimate for $s_{t}$ . Yogi offers such an alternative.