In [1]:
from vocab import *
from model_embeddings import *
from nmt_model import *
from utils import *
from sanity_check import *
import numpy as np

%load_ext autoreload
%autoreload 2

# loading data

Let's load some data that we may use to check our model. We use data from sanity check tests. 

As a reminder our `Vocab` class contains `src` and `tgt` vocabularies (of type `VocabEntry`) and methods to `build`, `save` and `load`. `VocabEntry` is a standard class that contains `word2id` dictionaries to encode our input (as a list of integers to feed into `Embedding` layer). In our testing case we have tiny vocabularies of under `100` entries.

In [2]:
vocab = Vocab.load('./sanity_check_en_es_data/vocab_sanity_check.json') 

In [3]:
len(vocab.src), len(vocab.tgt)

(77, 85)

In [4]:
list(vocab.src.word2id.items())[:10]

[('<pad>', 0),
 ('<s>', 1),
 ('</s>', 2),
 ('<unk>', 3),
 ('de', 4),
 ('que', 5),
 ('el', 6),
 ('en', 7),
 ('la', 8),
 ('a', 9)]

In [5]:
list(vocab.tgt.word2id.items())[:10]

[('<pad>', 0),
 ('<s>', 1),
 ('</s>', 2),
 ('<unk>', 3),
 ('the', 4),
 ('of', 5),
 ('to', 6),
 ('that', 7),
 ('and', 8),
 ('in', 9)]

To create our `model` we need this `vocab` and a couple of arguments that are defined in `sanity_check.py`. We change `EMBED_SIZE` to distinguish it from `HIDDEN_SIZE`.

In [6]:
BATCH_SIZE, EMBED_SIZE, HIDDEN_SIZE, DROPOUT_RATE

(5, 3, 3, 0.0)

In [7]:
model = NMT(
    embed_size=EMBED_SIZE-1,
    hidden_size=HIDDEN_SIZE,
    dropout_rate=DROPOUT_RATE,
    vocab=vocab)

In [8]:
model

NMT(
  (model_embeddings): ModelEmbeddings(
    (source): Embedding(77, 2, padding_idx=0)
    (target): Embedding(85, 2, padding_idx=0)
  )
  (encoder): LSTM(2, 3, bidirectional=True)
  (decoder): LSTMCell(5, 3)
  (h_projection): Linear(in_features=6, out_features=3, bias=False)
  (c_projection): Linear(in_features=6, out_features=3, bias=False)
  (att_projection): Linear(in_features=6, out_features=3, bias=False)
  (combined_output_projection): Linear(in_features=9, out_features=3, bias=False)
  (target_vocab_projection): Linear(in_features=3, out_features=85, bias=False)
  (dropout): Dropout(p=0.0)
)

Now let's load train data (both `source` and `target`). As we can see this is just a list of sentences in Spanish and English before encoding. 

Then we construct a batch of `BATCH_SIZE` and *sort* sentences by their length. This is necessary as we know for `pack_padded_sequence()` function.

The last step is to encode these sentences into list of integers using our `word2id` dictionaries. Shape is changed to `(seq_len, batch_size)`. We don't use `batch_first` approach with `LSTM` so we need `seq_len` to be the first dimension.

In [9]:
train_data_src = read_corpus('./sanity_check_en_es_data/train_sanity_check.es', 'src')
train_data_tgt = read_corpus('./sanity_check_en_es_data/train_sanity_check.en', 'tgt')
train_data = list(zip(train_data_src, train_data_tgt))

In [10]:
type(train_data_src), type(train_data_tgt)

(list, list)

In [11]:
print(train_data_src[0], len(train_data_src[0]))

['Pero,', 'qu', 'puedes', 'hacer?', 'Ests', 'en', 'el', 'medio', 'del', 'ocano.'] 10


In [12]:
print(train_data_tgt[0])

['<s>', 'But', 'what', 'can', 'you', 'do?', "You're", 'in', 'the', 'middle', 'of', 'the', 'ocean.', '</s>']


In [13]:
it = batch_iter(train_data, batch_size=BATCH_SIZE, shuffle=False)

In [14]:
src_sents, tgt_sents = next(it)

In [15]:
# len of our batches is 5 (this is our batch_size)
len(src_sents), len(tgt_sents)

(5, 5)

In [16]:
[len(s) for s in src_sents], [len(s) for s in tgt_sents]

([22, 15, 10, 9, 7], [20, 21, 14, 14, 8])

In [17]:
print(src_sents[2])

['Pero,', 'qu', 'puedes', 'hacer?', 'Ests', 'en', 'el', 'medio', 'del', 'ocano.']


In [18]:
source_lengths = [len(s) for s in src_sents]
source_padded = model.vocab.src.to_input_tensor(src_sents, device=model.device)

In [19]:
# we may see that indeed 22 is the max len of source batch
source_padded.shape

torch.Size([22, 5])

In [20]:
# let's check that our encoding is correct
vocab.src.words2indices(src_sents)[2]

[3, 23, 3, 3, 3, 7, 6, 3, 14, 3]

In [21]:
source_padded.t()[2]

tensor([ 3, 23,  3,  3,  3,  7,  6,  3, 14,  3,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0])

Now we have our source data and can go through `encode` forward pass.

In [22]:
target_padded = model.vocab.tgt.to_input_tensor(tgt_sents, device=model.device)

In [23]:
# so again we have 21 - max len in this batch
# of size 5
target_padded.shape

torch.Size([21, 5])

# `NMT` model

## `init`

What kind of layers do we need? Well, we need:

- we need `LSTM` layers for `encoder` and `decoder`; full `biLSTM` for `encoder` and one-directional `LSTMCell` for `decoder` (in case of decoder we have to incorporate attention into `LSTM`); 
- 2 `Embedding` layers for `encoder` and `decoder` (during training we supply gold target sequence to our `decoder`);
- we also need a bunch of `Linear` layers with or without `bias`; their functions and sizes are pretty clear from `pdf`;

There are a few questions:
    
- what is the `input_size` for `encoder`? well, we use concatenated vector $\bar{y}_t = [o_t; y_t]$ where $o_t$ has `hidden_size` (why is that? well, let's postpone this question untill attention section) and $y_t$ has `embedding_size`; so `input_size` is just the sum of these 2 sizes;
- how do we use the fact that $W \in \mathbb{R}^{h \times 2h}$? what should be the first and second dimensions of $W$? well, we use row input vector and transposed matrix: $xW^T$ (as stated in documentation) so dimension of $x$ should be the first dimension of $W^T$ or the second dimension of $W$; but things are a bit complicated here: shape of $W$ specified as `(out_features,in_features)`; the easiest way to escape these complications is to use named arguments: `in_features` is the size of input; `out_features` is the size of output (usually `hidden_size`);

## `encode()`

To encode source data we need 3 steps:

- step 1: get `embedding` from input tensor;
- step 2: forward pass of the `LSTM`;
- step 3: get `decoder` initial state (projections and concatenation);

###  step 1: get `embedding` from input tensor;

This step converts `source_padded` of shape `(seq_len, batch_size)` into `(seq_len, batch_size, embed_size)` which is `2` in our case.

In [24]:
source_padded.shape

torch.Size([22, 5])

In [25]:
X = model.model_embeddings.source(source_padded)

In [26]:
X.shape

torch.Size([22, 5, 2])

### step 2: forward pass of the `LSTM`

Forward pass of the `LSTM` has a technical complication of packing. We don't discuss it here. What should be the shape after forward pass? Well we know that `LSTM` basically apply a few matrix multiplication so `embed_size` should change to `hidden_size`. But we have `biLSTM` so size should be multiplied by `2`.

In [27]:
X_packed = pack_padded_sequence(X, lengths=source_lengths)
enc_hiddens, (last_hidden, last_cell) = model.encoder(X_packed)
enc_hiddens, _ = pad_packed_sequence(enc_hiddens)

In [28]:
model.hidden_size

3

In [29]:
enc_hiddens.shape

torch.Size([22, 5, 6])

We then need to put `batch_size` back to the first place. Why is that? Well it looks like that in `step()` everything in the `batch first` shape.

In [30]:
enc_hiddens = enc_hiddens.permute(1, 0, 2)

In [31]:
enc_hiddens.shape

torch.Size([5, 22, 6])

### step 3: `decoder` initial state

We need some concatenations and projections (just a matrix multiplication or applying a linear layer without a bias). Let's look only at `last_hidden_cat` - the second operation is the same. 

We need to concatenate 2 tensors from `biLSTM` to get a high-dimensional tensor (`2*h`) and then project it back to low-dimensional (`h`). We need some way to get from the bidirectional `encoder` to the one directional `decoder` (`eq. 1-2` in `pdf`).

In [32]:
# the first dimension is for 2 states of biLSTM
# we need to concatenate them
last_hidden.shape

torch.Size([2, 5, 3])

In [33]:
last_hidden[0, 0, :]

tensor([ 0.1965, -0.2512, -0.2357], grad_fn=<SliceBackward>)

In [34]:
last_hidden[1, 0, :]

tensor([-0.0804,  0.3910, -0.2768], grad_fn=<SliceBackward>)

In [35]:
last_hidden_cat = torch.cat((last_hidden[0, :, :], last_hidden[1, :, :]), dim=1)

In [36]:
last_hidden_cat.shape

torch.Size([5, 6])

In [37]:
last_hidden_cat[0, :]

tensor([ 0.1965, -0.2512, -0.2357, -0.0804,  0.3910, -0.2768],
       grad_fn=<SliceBackward>)

In [38]:
init_decoder_hidden = model.h_projection(last_hidden_cat)

In [39]:
# we decreased the size back to h
init_decoder_hidden.shape

torch.Size([5, 3])

In [40]:
last_cell_cat = torch.cat((last_cell[0, :, :], last_cell[1, :, :]), dim=1)
init_decoder_cell = model.c_projection(last_cell_cat)
dec_init_state = (init_decoder_hidden, init_decoder_cell)

In [41]:
!python3 sanity_check.py 1d

Running Sanity Check for Question 1d: Encode
--------------------------------------------------------------------------------
enc_hiddens Sanity Checks Passed!
dec_init_state[0] Sanity Checks Passed!
dec_init_state[1] Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1d: Encode!
--------------------------------------------------------------------------------


## `decode()`

So we have 4 steps:

- step 1: apply the attention projection layer;
- step 2: construct tensor `Y` of target sentences;
- step 3: iterate over the time dimension of `Y`;
- step 4: reshape `combined_outputs`;

### arguments

First of all what arguments do we have? 

- `enc_hiddens` - this is just an output of our `encode` model, so these are outputs for *all* timesteps - it looks like we need all of them; the only thing we do - permute 2 first dimensions so the final shape is `(b, src_len, h*2)` (see explanation above);
- `enc_masks` - we use only as an argument to `step()` and explain it later;
- `dec_init_state` - that's again we compute in our `encoder` and it contains a tuple `(init_decoder_hidden, init_decoder_cell)`;
- `target_padded` - this is out gold target data of shape `(seq_len, batch_size)`; again, we don't use batch first approach in our `decode` `LSTM`;

### initial operations

#### chop of the <END> token for max length sentences

In [42]:
target_padded.shape

torch.Size([21, 5])

In [43]:
# 2 is index for </s>
target_padded.t()[:, -5:]

tensor([[ 3,  3, 49,  2,  0],
        [37, 52, 27,  3,  2],
        [ 0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0]])

In [44]:
# we may see that indeed the 2nd row is of max len 21
# and now it ends with 3, not 2
# so we removed </s> from it
target_padded[:-1].t()[:, -5:]

tensor([[ 3,  3,  3, 49,  2],
        [ 3, 37, 52, 27,  3],
        [ 0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0],
        [ 0,  0,  0,  0,  0]])

In [45]:
target_padded = target_padded[:-1]

#### misc operations

In [46]:
# initialize the decoder state (hidden and cell)
dec_state = dec_init_state

# initialize previous combined output vector o_{t-1} as zero
batch_size = enc_hiddens.size(0)
o_prev = torch.zeros(batch_size, model.hidden_size, device=model.device)

# initialize a list we will use to collect the combined output o_t on each step
combined_outputs = []

### step 1: apply the attention projection layer

We have to deal with the difference between `encoder` and `decoder`, so we have to reduce dimensionality of `enc_hiddens` (`2*h`). We need this to get our multiplicative attention (`eq. 7` in `pdf`): 

$$e_{ti} = (h^{dec}_t)^T W_{attProj} h^{enc}_i$$

In [47]:
enc_hiddens.shape

torch.Size([5, 22, 6])

In [48]:
enc_hiddens_proj = model.att_projection(enc_hiddens)

In [49]:
# last dimension is reduced from 6 to 3
enc_hiddens_proj.shape

torch.Size([5, 22, 3])

### step 2: construct tensor `Y` of target sentences

We just embed the input to our `decoder`: for shape `(seq_len, batch_size)` we get `(seq_len, batch_size, embed_size)`.

In [50]:
# for some reason we removed last symbol
# and batch_size is the 2nd dimension
target_padded.shape

torch.Size([20, 5])

In [51]:
Y = model.model_embeddings.target(target_padded)

In [52]:
# embedding size is the same for encoder and for decoder
model.model_embeddings.embed_size

2

In [53]:
Y.shape

torch.Size([20, 5, 2])

### step 3: iterate over the time dimension of `Y`

This is the main loop of our decoder like described in `eq. 5` of our `pdf`:

$$h^{dec}_t, c^{dec}_t = decoder(\bar{y}_t, h^{dec}_{t-1}, c^{dec}_{t-1})$$

One more time - `split()` means that we just turn `(tgt_len, b, e)` tensor into tuple of `(1, b, e)` tensors. We use `LSTMCell`, not `LSTM` so we have to operate on one input (not on a sequence), but we're still able to process a batch.

In [54]:
x = torch.Tensor(np.arange(10*3*2).reshape(10, 3, 2))

In [55]:
x_split = torch.split(x, 1, dim=0)

In [56]:
type(x_split), len(x_split)

(tuple, 10)

In [57]:
x_split[0].shape

torch.Size([1, 3, 2])

In [58]:
x_split[0].squeeze().shape

torch.Size([3, 2])

All magic is incorporated in `step()` function which basically is the `forward` method of the `decoder`. The only additional argument we have to supply is $\bar{y}_t$. We construct it as follows.

In [59]:
Y_0 = torch.split(Y, 1, dim=0)[0]

In [60]:
Y.shape, Y_0.shape

(torch.Size([20, 5, 2]), torch.Size([1, 5, 2]))

In [61]:
Y_0 = Y_0.squeeze()

In [62]:
Y_0.shape

torch.Size([5, 2])

In [63]:
o_prev.shape

torch.Size([5, 3])

In [64]:
Ybar_0 = torch.cat((Y_0, o_prev), dim=1)

In [65]:
Ybar_0.shape

torch.Size([5, 5])

In [66]:
Y_0[0, :], o_prev[0, :]

(tensor([ 0.2890, -0.7240], grad_fn=<SliceBackward>), tensor([0., 0., 0.]))

In [67]:
Ybar_0[0, :]

tensor([ 0.2890, -0.7240,  0.0000,  0.0000,  0.0000], grad_fn=<SliceBackward>)

### step 4: stack `combined_outputs` from a list to a tensor

We need to stack list of `o_t` (of shape `(batch_size, hidden_size)`) to get combine output of our `decoder`. We need `seq_len` to be the first dimension so we use `torch.stack(combined_outputs, dim=0)`. So in this case we don't need `batch first` order.

In [68]:
# seq_len is 10 in our case
np.random.seed(42)
x = [torch.Tensor(np.random.randn(2, 3)) for _ in range(10)]

In [69]:
z = torch.stack(x, dim=0)

In [70]:
# so seq_len is i fact the first dimension
z.shape

torch.Size([10, 2, 3])

In [71]:
z = torch.stack(x, dim=1)

In [72]:
z.shape

torch.Size([2, 10, 3])

In [73]:
!python3 sanity_check.py 1e

--------------------------------------------------------------------------------
Running Sanity Check for Question 1e: Decode
--------------------------------------------------------------------------------
combined_outputs Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1e: Decode!
--------------------------------------------------------------------------------


## `step()`

So finally we have to write a function that computes attention. First of all let's understand how it should be computed:

- we compute attention scores $e_t = (h^{dec}_t W_{attProj} h^{enc}_i)$; so we need: (a) current decoder hidden state $h^{dec}_t$ - we compute it using `decoder` `forward` pass, we need `dec_state` and `Ybar_t`; (b) `enc_hiddens_proj` - it's an argument;
- we compute attention weights $\alpha_t = softmax(e_t)$;
- now we may compute attention output $a_t = \sum{\alpha_{ti} h^{enc}_i}$; we need hidden states of encoder `enc_hiddens`; 

Let's look at arguments of `step()`:

- `Ybar_t` - we compute in `decode` as concatenation of `Y_t` and `o_t`;
- `dec_state` - that's a result of the `forward` pass od our `decoder`;
- `enc_hiddens` - that's a result of `encode()`;
- `enc_hiddens_proj` - we compute this in `decode` by applying `Linear` layer;

### first part

Let's look at `torch.bmm`.

In [74]:
input = torch.randn(10, 3, 4)
mat2 = torch.randn(10, 4, 5)
res = torch.bmm(input, mat2)

In [75]:
res.shape

torch.Size([10, 3, 5])

In [76]:
input.bmm(mat2).shape

torch.Size([10, 3, 5])

Let's do the forward pass. 

Then we have to multiply `dec_hidden` of shape `(batch_size, hidden_size)` and `enc_hiddens_proj` of shape `(batch_size, seq_len, hidden_size)`. What is the shape of `e_t`? It should be `(batch_size, seq_len)`. In other words we multiply matrix of size `(seq_len, hidden_size)` and vector of size `hidden_size` for each row in a batch.

How do we get this result using `torch.bmm`? First of all we have to make `dec_hidden` a `3D` tensor using `unsqueeze()`. We then multiply `enc_hiddens_proj` and `dec_hidden` in this order. Finally we should remove last dimension to get `2D` tensor.

In [77]:
dec_state = model.decoder(Ybar_0, dec_state)

In [78]:
dec_hidden, dec_cell = dec_state

In [79]:
dec_hidden.shape

torch.Size([5, 3])

In [80]:
enc_hiddens_proj.shape

torch.Size([5, 22, 3])

In [81]:
dec_hidden.unsqueeze(dim=2).shape

torch.Size([5, 3, 1])

In [93]:
e_0 = torch.bmm(enc_hiddens_proj, dec_hidden.unsqueeze(dim=2))

In [94]:
e_0.shape

torch.Size([5, 22, 1])

In [95]:
e_0 = e_0.squeeze(dim=2)

In [96]:
e_0.shape

torch.Size([5, 22])

### second part

We now need to apply `softmax` to `e_t`. What `dimension` should we use? Well we need to get weights to combine `encoder` hidden states: $a_t = \sum{\alpha_{ti} h^{enc}_i}$. So we need to apply `softmax` to each row in a batch.

In [97]:
alpha_0 = F.softmax(e_t, dim=1)

In [98]:
alpha_0.shape

torch.Size([5, 22])

In [99]:
# so we in fact have weights that sum to 1
torch.sum(alpha_0[0, :])

tensor(1.0000, grad_fn=<SumBackward0>)

Next we need to get this attention output vector using formula above. So we need to use `torch.bmm` yet again. So we have to multiply `alpha_t` of shape `(b, seq_len)` and `enc_hiddens` of shape `(b, seq_len, 2h)` to get `(b, 2h)`.

In [89]:
enc_hiddens.shape

torch.Size([5, 22, 6])

In [100]:
alpha_0.unsqueeze(dim=1).shape

torch.Size([5, 1, 22])

In [101]:
torch.bmm(alpha_0.unsqueeze(dim=1), enc_hiddens).shape

torch.Size([5, 1, 6])

In [104]:
a_0 = torch.bmm(alpha_0.unsqueeze(dim=1), enc_hiddens).squeeze(dim=1)

In [105]:
a_0.shape

torch.Size([5, 6])

Finally we have to compute combined output $o_t$:

$$u_t = [a_t; h^{dec}_t]$$
$$v_t = W_u u_t$$
$$o_t = dropout(tanh(v_t))$$

In [106]:
a_0.shape, dec_hidden.shape

(torch.Size([5, 6]), torch.Size([5, 3]))

In [107]:
U_0 = torch.cat((a_0, dec_hidden), dim=1)

In [108]:
a_0[0, :], dec_hidden[0, :]

(tensor([ 0.1954, -0.1677, -0.1786, -0.0671,  0.3494, -0.2230],
        grad_fn=<SliceBackward>),
 tensor([-0.0669, -0.0143,  0.0723], grad_fn=<SliceBackward>))

In [109]:
U_0[0, :]

tensor([ 0.1954, -0.1677, -0.1786, -0.0671,  0.3494, -0.2230, -0.0669, -0.0143,
         0.0723], grad_fn=<SliceBackward>)

Everything else is straightforward.

In [110]:
V_0 = model.combined_output_projection(U_0)

In [112]:
O_0 = model.dropout(torch.tanh(V_0))

In [113]:
O_0.shape

torch.Size([5, 3])

In [115]:
!python3 sanity_check.py 1f

--------------------------------------------------------------------------------
Running Sanity Check for Question 1f: Step
--------------------------------------------------------------------------------
dec_state[0] Sanity Checks Passed!
dec_state[1] Sanity Checks Passed!
combined_output  Sanity Checks Passed!
e_t Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1f: Step!
--------------------------------------------------------------------------------


And this completes debugging of our `nmt_model`!