In [28]:
from vocab import *
from model_embeddings import *
from utils import * 

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## `Vocab`

Let's check `Vocab` class. This ia a standard class to deal with encoding of corpus. It contains familiar methods like `words2indices` that encode sentences as a list of integers using indicies in vocabulary.

It also overrides some methods like `__len__()` and `__getitem__` and other. So we may call `len` on it and use `[]` to get an index (see below).

In [3]:
file_path = 'vocab.json'
vocab = Vocab.load(file_path)

In [6]:
type(vocab), type(vocab.src), type(vocab.tgt)

(vocab.Vocab, vocab.VocabEntry, vocab.VocabEntry)

In [10]:
len(vocab.src.id2word), len(vocab.tgt.id2word)

(50004, 50002)

In [16]:
# we may call len directly on 
len(vocab.src), len(vocab.tgt)

(50004, 50002)

In [12]:
[vocab.src.id2word[i] for i in range(10)]

['<pad>', '<s>', '</s>', '<unk>', 'de', 'que', 'la', 'en', 'y', 'el']

In [13]:
[vocab.tgt.id2word[i] for i in range(10)]

['<pad>', '<s>', '</s>', '<unk>', 'the', 'to', 'of', 'a', 'and', 'that']

In [17]:
vocab.src['<pad>']

0

## `ModelEmbeddings`

So here's the question: why should we specify `padding_idx` in our `Embedding` layer? We may read in docs: *If given, pads the output with the embedding vector at padding_idx (initialized to zeros) whenever it encounters the index*. What does this mean?

Let's create 2 layers: with and without `padding_idx` parameter. We may see that indeed in the first scenario (but not in the second) we have vector of all zeros as an embedding for our `padding_idx`. 

In [32]:
embed_dim = 2
torch.manual_seed(42)
embed_with_pad = nn.Embedding(num_embeddings=len(vocab.src), 
                     embedding_dim=embed_dim, 
                     padding_idx=vocab.src['<pad>'])
embed_no_pad = nn.Embedding(num_embeddings=len(vocab.src), 
                     embedding_dim=embed_dim)

In [22]:
sents = ["it's a true story -- every bit of this is true .".split(), 
         "driving ourselves .".split(),
         "there's been kind of a series of epiphanies .".split()]

In [27]:
sents_int = vocab.src.words2indices(sents)
sents_int

[[3, 10, 3, 3, 65, 3, 48630, 4205, 3, 21832, 3, 1295],
 [3, 3, 1295],
 [3, 3, 3, 4205, 10, 13003, 4205, 3, 1295]]

In [29]:
sents_int_padded = pad_sents(sents=sents_int, 
                             pad_token=vocab.src['<pad>'])

In [30]:
sents_int_padded

[[3, 10, 3, 3, 65, 3, 48630, 4205, 3, 21832, 3, 1295],
 [3, 3, 1295, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [3, 3, 3, 4205, 10, 13003, 4205, 3, 1295, 0, 0, 0]]

In [35]:
embed_with_pad(torch.LongTensor(sents_int_padded))

tensor([[[-0.0431, -1.6047],
         [-0.7581,  1.0783],
         [-0.0431, -1.6047],
         [-0.0431, -1.6047],
         [-1.4364, -1.1299],
         [-0.0431, -1.6047],
         [ 0.3067,  0.5527],
         [ 0.0292,  1.8709],
         [-0.0431, -1.6047],
         [-1.8290,  0.3548],
         [-0.0431, -1.6047],
         [ 2.7125, -0.1935]],

        [[-0.0431, -1.6047],
         [-0.0431, -1.6047],
         [ 2.7125, -0.1935],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000],
         [ 0.0000,  0.0000]],

        [[-0.0431, -1.6047],
         [-0.0431, -1.6047],
         [-0.0431, -1.6047],
         [ 0.0292,  1.8709],
         [-0.7581,  1.0783],
         [ 0.3018,  1.1055],
         [ 0.0292,  1.8709],
         [-0.0431, -1.6047],
         [ 2.7125, -0.1935],
         [ 0.0000,  0.0000],
         [

In [36]:
embed_no_pad(torch.LongTensor(sents_int_padded))

tensor([[[ 0.5085, -1.6564],
         [ 1.8229, -0.3589],
         [ 0.5085, -1.6564],
         [ 0.5085, -1.6564],
         [ 0.7962, -3.4634],
         [ 0.5085, -1.6564],
         [-0.0915, -0.0545],
         [ 0.2374,  1.5812],
         [ 0.5085, -1.6564],
         [-1.2246, -0.0556],
         [ 0.5085, -1.6564],
         [ 0.0704,  1.0442]],

        [[ 0.5085, -1.6564],
         [ 0.5085, -1.6564],
         [ 0.0704,  1.0442],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530],
         [-0.1748,  0.4530]],

        [[ 0.5085, -1.6564],
         [ 0.5085, -1.6564],
         [ 0.5085, -1.6564],
         [ 0.2374,  1.5812],
         [ 1.8229, -0.3589],
         [ 1.6651,  0.6543],
         [ 0.2374,  1.5812],
         [ 0.5085, -1.6564],
         [ 0.0704,  1.0442],
         [-0.1748,  0.4530],
         [