# Building the Transformer from Scratch

In this notebook, we'll be implementing the famous Transformer architecture from scratch.

The code is based off of the following repos/blog posts:

- [attention-is-all-you-need-pytorch](https://github.com/jadore801120/attention-is-all-you-need-pytorch)
- [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT)
- [The Annotated Transformer](http://nlp.seas.harvard.edu/2018/04/03/attention.html) 

Thanks so much to their authors!

In [1]:
import torch
import torch.nn as nn
import numpy as np

One of the keys to understanding how any model works is understanding how the shapes of the tensors change during the processing of each part. We'll be using the logging module to output debugging information to help our understanding.

In [2]:
import logging
logger = logging.getLogger("tensor_shapes")
handler = logging.StreamHandler()
formatter = logging.Formatter(
        '%(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
# if you want the model to continuously print tensor shapes, set to DEBUG!
logger.setLevel(1)

In [3]:
import inspect
def getclass():
    stack = inspect.stack()
    return stack[3][0].f_locals["self"].__class__

# A helper function to check how tensor sizes change
def log_size(tsr: torch.Tensor, name: str):
    cls = getclass()
    logger.log(level=cls.level, msg=f"[{cls.__name__}] {name} size={tsr.shape}")

We'll use logging levels to control the modules we receive output from. The lower the logging level, the more tensor information you'll get. Feel free to play around!

In [4]:
from enum import IntEnum
# Control how much debugging output we want
class TensorLoggingLevels(IntEnum):
    attention = 1
    attention_head = 2
    multihead_attention_block = 3
    enc_dec_block = 4
    enc_dec = 5

We'll be using an enum to refer to dimensions whenever possible to improve readability.

In [5]:
class Dim(IntEnum):
    batch = 0
    seq = 1
    feature = 2

# Components

### Scaled dot product attention

The Transformer is an attention-based architecture. The attention used in the Transformer is the scaled dot product attention, represented by the following formula.

$$ \textrm{Attention}(Q, K, V) = \textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

![image](https://i2.wp.com/mlexplained.com/wp-content/uploads/2017/12/scaled_dot_product_attention.png?zoom=2&w=750)

In [6]:
import math

class ScaledDotProductAttention(nn.Module):
    level = TensorLoggingLevels.attention # Logging level: 
    def __init__(self, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    def forward(self, q, k, v, mask=None):
        d_k = k.size(-1) # get the size of the key
        assert q.size(-1) == d_k

        # compute the dot product between queries and keys for
        # each batch and position in the sequence
        attn = torch.bmm(q, k.transpose(Dim.seq, Dim.feature)) # (Batch, Seq, Seq)
        # we get an attention score between each position in the sequence
        # for each batch

        # scale the dot products by the dimensionality (see the paper for why we do this!)
        attn = attn / math.sqrt(d_k)
        # normalize the weights across the sequence dimension
        # (Note that since we transposed, the sequence and feature dimensions are switched)
        attn = torch.exp(attn)
        log_size(attn, "attention weight") # (Batch, Seq, Seq)
        
        # fill attention weights with 0s where padded
        if mask is not None: attn = attn.masked_fill(mask, 0)
        attn = attn / attn.sum(dim=-1, keepdim=True)
        attn = self.dropout(attn)
        output = torch.bmm(attn, v) # (Batch, Seq, Feature)
        log_size(output, "attention output size") # (Batch, Seq, Seq)
        return output

In [7]:
attn = ScaledDotProductAttention()

In [8]:
q = torch.rand(5, 10, 20)
k = torch.rand(5, 10, 20)
v = torch.rand(5, 10, 20)

In [9]:
attn(q, k, v)

[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])


tensor([[[0.4483, 0.6249, 0.4408, 0.5156, 0.4743, 0.5172, 0.4808, 0.6804,
          0.4991, 0.4802, 0.6905, 0.4754, 0.4845, 0.4595, 0.4427, 0.4852,
          0.6439, 0.3720, 0.4924, 0.4870],
         [0.4369, 0.5293, 0.4167, 0.4561, 0.4426, 0.4175, 0.4314, 0.6212,
          0.4854, 0.3984, 0.5607, 0.3664, 0.4413, 0.3703, 0.3464, 0.4736,
          0.5042, 0.3894, 0.3976, 0.4720],
         [0.4000, 0.6714, 0.4945, 0.5342, 0.3825, 0.4558, 0.4701, 0.7120,
          0.5406, 0.4414, 0.6396, 0.5229, 0.4415, 0.5028, 0.4552, 0.4466,
          0.5253, 0.4450, 0.4930, 0.4459],
         [0.4549, 0.6190, 0.4529, 0.5140, 0.4734, 0.5101, 0.4743, 0.6831,
          0.4990, 0.4746, 0.6796, 0.4653, 0.4834, 0.4589, 0.4280, 0.4894,
          0.6144, 0.3771, 0.4860, 0.4962],
         [0.3270, 0.4722, 0.4131, 0.3552, 0.2582, 0.2866, 0.3154, 0.5365,
          0.5085, 0.2973, 0.4904, 0.3656, 0.3859, 0.3883, 0.3579, 0.3699,
          0.4034, 0.3194, 0.3842, 0.3813],
         [0.5001, 0.7027, 0.5021, 0.5506, 0.4

### Multi-Head Attention

Now, we turn to the core component in the Transformer architecture: the multi-head attention block. This block applies linear transformations to the input, then applies scaled dot product attention.

![image](https://i2.wp.com/mlexplained.com/wp-content/uploads/2017/12/multi_head_attention.png?zoom=2&resize=224%2C293)

In [10]:
class AttentionHead(nn.Module):
    level = TensorLoggingLevels.attention_head
    def __init__(self, d_model, d_feature, dropout=0.1):
        super().__init__()
        # We will assume the queries, keys, and values all have the same feature size
        self.attn = ScaledDotProductAttention(dropout)
        self.query_tfm = nn.Linear(d_model, d_feature)
        self.key_tfm = nn.Linear(d_model, d_feature)
        self.value_tfm = nn.Linear(d_model, d_feature)

    def forward(self, queries, keys, values, mask=None):
        Q = self.query_tfm(queries) # (Batch, Seq, Feature)
        K = self.key_tfm(keys) # (Batch, Seq, Feature)
        V = self.value_tfm(values) # (Batch, Seq, Feature)
        log_size(Q, "queries, keys, vals")
        # compute multiple attention weighted sums
        x = self.attn(Q, K, V, mask)
        return x

In [11]:
attn_head = AttentionHead(20, 20)
attn_head(q, k, v)

[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[ScaledDotProductAttention] attention weight size=torch.Size([5, 10, 10])
[ScaledDotProductAttention] attention output size size=torch.Size([5, 10, 20])


tensor([[[ 0.0238,  0.1517, -0.1136,  0.3600,  0.5793,  0.4810,  0.5199,
           0.2519,  0.4256,  0.4177, -0.2117, -0.6132,  0.0139, -0.1483,
           0.0267,  0.1633, -0.0067,  0.8864, -0.3889, -0.5447],
         [ 0.0454,  0.1453, -0.0759,  0.3506,  0.5473,  0.4195,  0.4682,
           0.2472,  0.4111,  0.3637, -0.2134, -0.5537,  0.0039, -0.1573,
           0.0459,  0.1266,  0.0056,  0.8011, -0.3324, -0.4825],
         [ 0.0131,  0.1344, -0.0980,  0.3264,  0.5227,  0.4492,  0.4796,
           0.2134,  0.3958,  0.3693, -0.1795, -0.5554,  0.0123, -0.1559,
           0.0100,  0.1447, -0.0218,  0.7680, -0.3628, -0.4881],
         [-0.0106,  0.1171, -0.1473,  0.3370,  0.5457,  0.5062,  0.4814,
           0.2054,  0.3945,  0.3926, -0.1718, -0.5806,  0.0201, -0.1544,
           0.0344,  0.1406, -0.0233,  0.8585, -0.4306, -0.5781],
         [-0.0053,  0.1495, -0.1153,  0.3350,  0.5565,  0.5025,  0.5206,
           0.1967,  0.4121,  0.4178, -0.1702, -0.6003,  0.0268, -0.1763,
          

The multi-head attention block simply applies multiple attention heads, then concatenates the outputs and applies a single linear projection.

In [12]:
# We'll supress logging from the scaled dot product attention now
logger.setLevel(TensorLoggingLevels.attention_head)

In [13]:
class MultiHeadAttention(nn.Module):
    level = TensorLoggingLevels.multihead_attention_block
    def __init__(self, d_model, d_feature, n_heads, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.d_feature = d_feature
        self.n_heads = n_heads
        # in practice, d_model == d_feature * n_heads
        assert d_model == d_feature * n_heads

        # Note that this is very inefficient:
        # I am merely implementing the heads separately because it is 
        # easier to understand this way
        self.attn_heads = nn.ModuleList([
            AttentionHead(d_model, d_feature, dropout) for _ in range(n_heads)
        ])
        self.projection = nn.Linear(d_feature * n_heads, d_model) 
    
    def forward(self, queries, keys, values, mask=None):
        log_size(queries, "Input queries")
        x = [attn(queries, keys, values, mask=mask) # (Batch, Seq, Feature)
             for i, attn in enumerate(self.attn_heads)]
        log_size(x[0], "output of single head")
        
        # reconcatenate
        x = torch.cat(x, dim=Dim.feature) # (Batch, Seq, D_Feature * n_heads)
        log_size(x, "concatenated output")
        x = self.projection(x) # (Batch, Seq, D_Model)
        log_size(x, "projected output")
        return x

In [14]:
heads = MultiHeadAttention(20 * 8, 20, 8)
heads(q.repeat(1, 1, 8), 
      k.repeat(1, 1, 8), 
      v.repeat(1, 1, 8))

[MultiHeadAttention] Input queries size=torch.Size([5, 10, 160])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[AttentionHead] queries, keys, vals size=torch.Size([5, 10, 20])
[MultiHeadAttention] output of single head size=torch.Size([5, 10, 20])
[MultiHeadAttention] concatenated output size=torch.Size([5, 10, 160])
[MultiHeadAttention] projected output size=torch.Size([5, 10, 160])


tensor([[[-5.5434e-02,  7.3836e-02,  2.1540e-01,  ...,  9.5668e-02,
           2.5618e-01,  3.3740e-02],
         [-5.1989e-02,  9.7644e-02,  1.7390e-01,  ...,  1.2145e-01,
           2.1276e-01,  5.6385e-02],
         [-5.7726e-02,  7.8944e-02,  2.1749e-01,  ...,  8.0804e-02,
           2.5118e-01,  1.2463e-02],
         ...,
         [-6.4274e-02,  5.6929e-02,  1.9471e-01,  ...,  1.0194e-01,
           2.4769e-01,  4.6319e-02],
         [-3.4287e-02,  8.8949e-02,  2.4115e-01,  ...,  1.0646e-01,
           1.9937e-01,  3.4541e-02],
         [-5.6988e-02,  9.4813e-02,  2.2184e-01,  ...,  1.0748e-01,
           2.5925e-01,  6.2197e-03]],

        [[-1.7362e-03,  1.3871e-01,  2.2642e-01,  ...,  1.0594e-01,
           2.1807e-01,  3.7778e-02],
         [-1.9615e-02,  1.2238e-01,  2.5283e-01,  ...,  1.1735e-01,
           1.8682e-01,  4.4808e-02],
         [-1.8819e-02,  1.2427e-01,  2.6472e-01,  ...,  1.2316e-01,
           2.0093e-01,  5.4456e-02],
         ...,
         [-4.7275e-02,  1

### The Encoder

With these core components in place, implementing the encoder is pretty easy.

![image](https://i2.wp.com/mlexplained.com/wp-content/uploads/2017/12/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2017-12-29-19.14.41.png?w=273)

The encoder consists of the following components:
- A multi-head attention block
- A simple feedforward neural network

These components are connected using residual connections and layer normalization

In [15]:
# We'll supress logging from the individual attention heads
logger.setLevel(TensorLoggingLevels.multihead_attention_block)

Layer normalization is similar to batch normalization, but normalizes across the feature dimension instead of the batch dimension.

![image](https://i1.wp.com/mlexplained.com/wp-content/uploads/2018/01/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-01-11-11.48.12.png?w=1500)

In [16]:
class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-8):
        super(LayerNorm, self).__init__()
        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta

The encoder just stacks these together

In [17]:
class EncoderBlock(nn.Module):
    level = TensorLoggingLevels.enc_dec_block
    def __init__(self, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        self.attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        self.layer_norm1 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        self.position_wise_feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )
        self.layer_norm2 = LayerNorm(d_model)
        
    def forward(self, x, mask=None):
        log_size(x, "Encoder block input")
        att = self.attn_head(x, x, x, mask=mask)
        log_size(x, "Attention output")
        # Apply normalization and residual connection
        x = x + self.dropout(self.layer_norm1(att))
        # Apply position-wise feedforward network
        pos = self.position_wise_feed_forward(x)
        log_size(x, "Feedforward output")
        # Apply normalization and residual connection
        x = x + self.dropout(self.layer_norm2(pos))
        log_size(x, "Encoder size output")
        return x

In [18]:
enc = EncoderBlock()

In [19]:
enc(torch.rand(5, 10, 512))

[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] output of single head size=torch.Size([5, 10, 64])
[MultiHeadAttention] concatenated output size=torch.Size([5, 10, 512])
[MultiHeadAttention] projected output size=torch.Size([5, 10, 512])
[EncoderBlock] Attention output size=torch.Size([5, 10, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 10, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 10, 512])


tensor([[[-0.9502, -0.0386,  2.0490,  ...,  4.3722, -0.8155, -0.6273],
         [-1.1045, -0.5730,  1.6829,  ...,  3.7669, -0.6746,  0.1586],
         [-0.5030, -0.8161,  1.9470,  ...,  3.7542, -1.6728, -0.6087],
         ...,
         [-1.6413, -0.8944,  1.5399,  ...,  5.0267, -0.0814,  0.7338],
         [-1.1089, -0.3992,  1.6307,  ...,  3.5277, -0.9177,  0.6617],
         [-0.7696, -0.6927,  0.7905,  ...,  3.7786, -1.1211, -0.0751]],

        [[-0.3022,  0.1411,  2.5665,  ...,  4.0616, -0.1462,  0.6828],
         [-0.8449,  0.0744,  1.3892,  ...,  4.0403, -0.6635,  0.3824],
         [-0.8907,  0.0267,  2.7662,  ...,  3.9748, -0.4940,  0.9116],
         ...,
         [-0.1639,  0.4431,  0.7396,  ...,  3.4216, -1.1346,  0.2292],
         [-1.2960,  0.0809,  2.0951,  ...,  4.3827, -1.1978,  0.9659],
         [-0.8447, -0.1919,  2.9091,  ...,  4.0368, -1.3761,  1.0358]],

        [[-2.2698, -0.7820,  1.9180,  ...,  3.3510, -0.1515,  0.4817],
         [-1.7525, -0.6157,  1.3984,  ...,  4

The encoder consists of 6 consecutive encoder blocks, so can simply be implemented like the following

In [20]:
class TransformerEncoder(nn.Module):
    level = TensorLoggingLevels.enc_dec
    def __init__(self, n_blocks=6, d_model=512,
                 n_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.encoders = nn.ModuleList([
            EncoderBlock(d_model=d_model, d_feature=d_model // n_heads,
                         d_ff=d_ff, dropout=dropout)
            for _ in range(n_blocks)
        ])
    
    def forward(self, x: torch.FloatTensor, mask=None):
        for encoder in self.encoders:
            x = encoder(x)
        return x

### The Decoder

The decoder is mostly the same as the encoder. There's just one additional multi-head attention block that takes the target sentence as input.

![image](https://i1.wp.com/mlexplained.com/wp-content/uploads/2017/12/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2017-12-29-19.14.47.png?w=287)

The keys and values are the outputs of the encoder, and the queries are the outputs of the multi-head attention over the target entence embeddings.

In [21]:
class DecoderBlock(nn.Module):
    level = TensorLoggingLevels.enc_dec_block
    def __init__(self, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        self.masked_attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        self.attn_head = MultiHeadAttention(d_model, d_feature, n_heads, dropout)
        self.position_wise_feed_forward = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

        self.layer_norm1 = LayerNorm(d_model)
        self.layer_norm2 = LayerNorm(d_model)
        self.layer_norm3 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_out, 
                src_mask=None, tgt_mask=None):
        # Apply attention to inputs
        att = self.masked_attn_head(x, x, x, mask=src_mask)
        x = x + self.dropout(self.layer_norm1(att))
        # Apply attention to the encoder outputs and outputs of the previous layer
        att = self.attn_head(queries=x, keys=enc_out, values=enc_out, mask=tgt_mask)
        x = x + self.dropout(self.layer_norm2(att))
        # Apply position-wise feedforward network
        pos = self.position_wise_feed_forward(x)
        x = x + self.dropout(self.layer_norm2(pos))
        return x

In [22]:
dec = DecoderBlock()
dec(torch.rand(5, 10, 512), enc(torch.rand(5, 10, 512)))

[EncoderBlock] Encoder block input size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] output of single head size=torch.Size([5, 10, 64])
[MultiHeadAttention] concatenated output size=torch.Size([5, 10, 512])
[MultiHeadAttention] projected output size=torch.Size([5, 10, 512])
[EncoderBlock] Attention output size=torch.Size([5, 10, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 10, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] output of single head size=torch.Size([5, 10, 64])
[MultiHeadAttention] concatenated output size=torch.Size([5, 10, 512])
[MultiHeadAttention] projected output size=torch.Size([5, 10, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 10, 512])
[MultiHeadAttention] output of single head size=torch.Size([5, 10, 64])
[MultiHeadAttention] concatenated output size=torch.Siz

tensor([[[-3.2290,  0.3235,  0.1065,  ..., -2.4907, -1.7681, -0.7010],
         [-1.5472, -2.2766,  0.2698,  ..., -3.2174, -0.7116, -1.0080],
         [-2.5818, -1.7114, -0.1525,  ..., -3.4856, -1.1320, -0.2742],
         ...,
         [-1.9990, -0.7424, -0.1879,  ..., -4.0495, -2.0523, -0.8105],
         [-2.6684, -0.6335,  0.1777,  ..., -2.4918, -1.2974, -0.7625],
         [-3.1428, -3.6049, -0.0199,  ..., -3.4796, -0.6485, -0.1586]],

        [[-2.9487, -0.8596,  1.5827,  ..., -2.5604, -0.1855, -1.1804],
         [-3.5555, -1.3430,  0.7323,  ..., -1.9973,  1.3033,  0.3560],
         [-3.7100, -1.6735,  0.5658,  ..., -2.1933,  0.0994, -0.2620],
         ...,
         [-4.5006, -1.8862,  0.6199,  ..., -3.7571,  0.4341, -0.7318],
         [-4.1769, -1.6887,  0.3228,  ..., -1.7997,  0.9390, -0.0960],
         [-3.2155, -1.7711, -0.0478,  ..., -4.2723, -0.5255, -0.0749]],

        [[-3.0206, -1.1389,  0.3656,  ..., -2.1171,  1.4828, -1.0802],
         [-3.6504, -2.5347,  1.1674,  ..., -2

Again, the decoder is just a stack of the underlying block so is simple to implement.

In [23]:
class TransformerDecoder(nn.Module):
    level = TensorLoggingLevels.enc_dec
    def __init__(self, n_blocks=6, d_model=512, d_feature=64,
                 d_ff=2048, n_heads=8, dropout=0.1):
        super().__init__()
        self.position_embedding = PositionalEmbedding(d_model)
        self.decoders = nn.ModuleList([
            DecoderBlock(d_model=d_model, d_feature=d_model // n_heads,
                         d_ff=d_ff, dropout=dropout)
            for _ in range(n_blocks)
        ])
        
    def forward(self, x: torch.FloatTensor, 
                enc_out: torch.FloatTensor, 
                src_mask=None, tgt_mask=None):
        for decoder in self.decoders:
            x = decoder(x, enc_out, src_mask=src_mask, tgt_mask=tgt_mask)
        return x

### Positional Embeddings

Attention blocks are just simple matrix multiplications: therefore they don't have any notion of order! The Transformer explicitly adds positional information via the positional embeddings.

In [24]:
class PositionalEmbedding(nn.Module):
    level = 1
    def __init__(self, d_model, max_len=512):
        super().__init__()        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.weight = nn.Parameter(pe, requires_grad=False)
        
    def forward(self, x):
        return self.weight[:, :x.size(1), :] # (1, Seq, Feature)

In [25]:
class WordPositionEmbedding(nn.Module):
    level = 1
    def __init__(self, vocab_size, d_model=512):
        super().__init__()
        self.word_embedding = nn.Embedding(vocab_size, d_model)
        self.position_embedding = PositionalEmbedding(d_model)
        
    def forward(self, x: torch.LongTensor, mask=None) -> torch.FloatTensor:
        return self.word_embedding(x) + self.position_embedding(x)

In [26]:
emb = WordPositionEmbedding(1000)
encoder = TransformerEncoder()

In [27]:
encoder(emb(torch.randint(1000, (5, 30))))

[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 30, 512])
[MultiHeadAttention] output of single head size=torch.Size([5, 30, 64])
[MultiHeadAttention] concatenated output size=torch.Size([5, 30, 512])
[MultiHeadAttention] projected output size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[MultiHeadAttention] Input queries size=torch.Size([5, 30, 512])
[MultiHeadAttention] output of single head size=torch.Size([5, 30, 64])
[MultiHeadAttention] concatenated output size=torch.Size([5, 30, 512])
[MultiHeadAttention] projected output size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[

tensor([[[  0.3980,  -4.3829,  -7.4285,  ...,   3.1774,   1.7740,   2.5976],
         [  0.0647,  -4.1780,  -7.2751,  ...,   3.9347,  -0.8255,   5.2059],
         [  0.3450,  -5.0790,  -6.8530,  ...,   1.7033,   2.2193,   0.8902],
         ...,
         [  4.6940,  -4.9993, -11.5730,  ...,   1.8663,   1.8704,   1.3255],
         [  0.1014,  -5.3799,  -7.3519,  ...,   1.3983,   3.2914,   3.3760],
         [  0.2468,  -3.6646,  -9.4579,  ...,   5.5852,   1.0168,   1.5736]],

        [[  5.9047,  -4.4927,  -8.4207,  ...,   2.1502,   3.5834,   4.6961],
         [  4.1149,   0.2121,  -3.7838,  ...,   3.4321,  -0.5452,   2.7152],
         [  3.0612,  -4.8328,  -4.9506,  ...,   1.6598,   5.2516,   2.9424],
         ...,
         [  4.9755,  -3.8555,  -7.2134,  ...,   4.7120,   5.0917,   3.8296],
         [  9.1925,  -4.7523,  -4.5528,  ...,   2.6217,   3.4018,  -1.0523],
         [  5.9289,  -4.7409,  -7.0705,  ...,   4.5752,   2.3157,   1.0976]],

        [[  5.5719,  -3.3389,  -8.9670,  ...

### Putting it All Together

Let's put everything together now.

![image](https://camo.githubusercontent.com/88e8f36ce61dedfd2491885b8df2f68c4d1f92f5/687474703a2f2f696d6775722e636f6d2f316b72463252362e706e67)

In [28]:
# We'll supress logging from the scaled dot product attention now
logger.setLevel(TensorLoggingLevels.enc_dec_block)

In [29]:
emb = WordPositionEmbedding(1000)
encoder = TransformerEncoder()
decoder = TransformerDecoder()

In [30]:
src_ids = torch.randint(1000, (5, 30))
tgt_ids = torch.randint(1000, (5, 30))
x = encoder(emb(src_ids))
decoder(emb(tgt_ids), x)

[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder block input size=torch.Size([5, 30, 512])
[EncoderBlock] Attention output size=torch.Size([5, 30, 512])
[EncoderBlock] Feedforward output size=torch.Size([5, 30, 512])
[EncoderBlock] Encoder size output size=t

tensor([[[-3.9986,  5.7489,  6.7348,  ..., -6.9215,  7.3174,  3.6647],
         [ 1.9130, -1.8961,  3.4541,  ..., -6.9436,  4.5712,  3.4783],
         [-3.4003,  2.4626,  4.9213,  ..., -5.5661,  3.2948,  4.1336],
         ...,
         [-1.5915, -0.5826,  7.6223,  ...,  0.1395,  5.7651,  3.3921],
         [-3.5197, -3.5261,  5.5072,  ..., -6.2547,  1.4598,  3.8519],
         [-2.9156, -0.7501,  3.9160,  ..., -5.9510,  2.6685,  1.9906]],

        [[-0.7690, -1.0143,  1.3752,  ..., -6.4549,  2.9288,  4.6994],
         [-3.7906, -3.1318,  1.3130,  ..., -8.5190,  2.9146,  3.7494],
         [-0.5046, -2.0664,  5.1320,  ..., -7.0491,  2.5344,  4.0302],
         ...,
         [-1.2325, -1.2299,  1.2333,  ..., -6.6735,  1.6056,  4.2359],
         [-2.7019, -2.6522, -1.4139,  ..., -4.7800,  1.5574,  2.3967],
         [ 0.3484, -4.1592,  2.2151,  ..., -6.2922,  1.4931,  0.0246]],

        [[-0.9702,  3.6488,  5.2317,  ..., -5.5706,  2.4317,  4.3568],
         [-5.4813, -0.7727,  7.8615,  ..., -5