# Positionwise FFN & AddNorm

Transformer is an instance of the encoder-decoder architecture:

![jupyter](../images/10/transformer.svg)

We have already described multi-head attention and positional encoding before.

In the following, we will show the other components of the transformer model.

## Positionwise Feed-Forward Network

Positionwise FFN transforms the representation at all the sequance positions using the same MLP.

In [1]:
import torch
from torch import nn

In [2]:
#@save
class PositionWiseFFN(nn.Module):
    def __init__(self, ffn_num_input, ffn_num_hiddens):
        super(PositionWiseFFN, self).__init__()
        self.dense1 = nn.Linear(ffn_num_input, ffn_num_hiddens)
        self.relu = nn.ReLU()
        self.dense2 = nn.Linear(ffn_num_hiddens, ffn_num_input)

    def forward(self, X):
        """
        X shape: (batch_size, seq_len, ffn_num_input)
        """
        return self.dense2(self.relu(self.dense1(X)))

In [3]:
ffn = PositionWiseFFN(4, 8)
ffn.eval()
ffn(torch.ones((2, 3, 4)))

tensor([[[ 0.2890,  0.0860, -0.2614,  0.1723],
         [ 0.2890,  0.0860, -0.2614,  0.1723],
         [ 0.2890,  0.0860, -0.2614,  0.1723]],

        [[ 0.2890,  0.0860, -0.2614,  0.1723],
         [ 0.2890,  0.0860, -0.2614,  0.1723],
         [ 0.2890,  0.0860, -0.2614,  0.1723]]], grad_fn=<AddBackward0>)

## Add & Norm

Layer normalization: e.g. each sample is normalized.

Batch normalization: e.g. each feature is normalized.

In [4]:
ln = nn.LayerNorm(3)
bn = nn.BatchNorm1d(3)
X = torch.tensor([[1, 2, 3], [4, 6, 8]], dtype=torch.float32)
print('layer norm:', ln(X), '\nbatch norm:', bn(X))

layer norm: tensor([[-1.2247,  0.0000,  1.2247],
        [-1.2247,  0.0000,  1.2247]], grad_fn=<NativeLayerNormBackward>) 
batch norm: tensor([[-1.0000, -1.0000, -1.0000],
        [ 1.0000,  1.0000,  1.0000]], grad_fn=<NativeBatchNormBackward>)


In [5]:
#@save
class AddNorm(nn.Module):
    """a residual connection followed by layer normalization."""
    def __init__(self, normalized_shape, dropout):
        super(AddNorm, self).__init__()
        self.dropout = nn.Dropout(dropout)
        self.ln = nn.LayerNorm(normalized_shape)

    def forward(self, X, Y):
        """
        X shape: (N, normalized_shape)
        The mean and standard-deviation are calculated separately over the last certain number dimensions
        which have to be of the shape specified by normalized_shape
        ln(X) = (X - E(X)) / sqrt(Var(X) + eps) * gamma + beta
        gamma and beta are normalized_shape learned parameters
        """
        return self.ln(self.dropout(Y) + X)

In [6]:
add_norm = AddNorm([3, 4], 0.5)
add_norm.eval()
add_norm(torch.ones((2, 3, 4)), torch.ones((2, 3, 4))).shape

torch.Size([2, 3, 4])