In [1]:
from vocab import *
from utils import *
from model_embeddings import *
from nmt_model import *

%load_ext autoreload
%autoreload 2

It's not easy to debug the model - we don't have tests and shape hints, and also it uses new for this course `conv1D` and brand new `highway` layers. First of all - what are we going to achieve? We don't use word embeddings in a usual way. Instead we build them in the following steps (see `figure 2` in `pdf`):

- we encode chars in a word using char vocabulary (and this is already done);
- we use `Embedding` layer for chars (not for words); we then use `conv1D` and `highway` layers (that are built from scratch in separate files); we do this in `ModelEmbeddings` class;

## train data

There are 2 issues with A5 (partially mentioned above):

- we don't have shape hints and have to figure out these shapes; it's easier for me to do using actual data, not theoretically; we may also create (missing) tests using these data;
- we also have to build all classes in the reverse order and that's also not helpful; we may start from creating data, then encode them, turn them into char embedding and so on;

### raw data

First of all let's get some raw data. 

In [2]:
!head -3 './en_es_data/train_tiny.es'

Muchas gracias Chris. Y es en verdad un gran honor tener la oportunidad de venir a este escenario por segunda vez. Estoy extremadamente agradecido.
He quedado conmovido por esta conferencia, y deseo agradecer a todos ustedes sus amables comentarios acerca de lo que tena que decir la otra noche.
Y digo eso sinceramente, en parte porque -- (Sollozos fingidos) -- lo necesito!  Pnganse en mi posicin!


In [3]:
train_data_src = read_corpus('./en_es_data/train_tiny.es', source='src')
train_data_tgt = read_corpus('./en_es_data/train_tiny.en', source='tgt')
train_data = list(zip(train_data_src, train_data_tgt))

In [4]:
train_batch_size = 3

In [5]:
it = batch_iter(train_data, batch_size=train_batch_size, shuffle=False)

In [6]:
src_sents, tgt_sents = next(it)

In [7]:
[len(src_sents[i]) for i in range(len(src_sents))]

[25, 24, 18]

In [8]:
type(src_sents[0]), len(src_sents[0])

(list, 25)

In [9]:
print([src_sents[i][:5] for i in range(3)])

[['He', 'quedado', 'conmovido', 'por', 'esta'], ['Muchas', 'gracias', 'Chris.', 'Y', 'es'], ['Y', 'digo', 'eso', 'sinceramente,', 'en']]


### encoded data

In [10]:
vocab = Vocab.load('vocab_tiny_q1.json')

In [11]:
len(vocab.src), len(vocab.tgt)

(132, 132)

In [12]:
# our src vocab_entry contains the same 
# predefined char_list
len(vocab.src.char_list)

92

In [13]:
source_padded_chars = vocab.src.to_input_tensor_char(src_sents, device=torch.device('cpu')) 

In [14]:
source_padded_chars.shape

torch.Size([25, 3, 21])

In [15]:
source_padded_chars_resh = source_padded_chars.contiguous().view(3, 25, 21)

In [16]:
source_padded_chars_resh[0, 0, :]

tensor([ 1, 11, 34,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0])

This is encoding of `He` (together with `{` and `}` symbols and padding).

In [17]:
[vocab.src.id2char[i] for i in source_padded_chars_resh[0, 0, :5].numpy()]

['{', 'H', 'e', '}', '<pad>']

We know this shape from previous analysis. Here `21` is our predefined `max_seq_len`, `25` is `max_word_len` in our batch (see above) and `3` is `batch_size`. But why do we need this shape? It looks like `source_padded_chars` is an input to our char `Embedding` layer. Let's try to figure out.

## `ModelEmbeddings`

### char `Embedding` 

First layer in `ModelEmbeddings` is `Embedding` layer. To create this layer we need 2 parameters:

- `num_embeddings` - this is the length of vocabulary, in our case char vocabulary; in our case `96` (see below);
- `embedding_dim` - is predefined and equal to `50` (which is lower than `256` in assignment `4`); 

We actually have predefined char vocabulary:

In [18]:
vocab_entry = VocabEntry()

In [19]:
vocab_entry.char_list[:5], len(vocab_entry.char_list)

(['A', 'B', 'C', 'D', 'E'], 92)

In [20]:
# so we have 4 additional symbols in addition to char_list
print(list(vocab_entry.char2id.items())[:5], len(vocab_entry.char2id))

[('<pad>', 0), ('{', 1), ('}', 2), ('<unk>', 3), ('A', 4)] 96


What should be the input to this layer? It looks like `Embedding` just adds one dimension - from docs:

- Output: `(*, H)`, where `(*)` is the input shape and `H` `char_embedding_dim`.

In [21]:
torch.manual_seed(42)
embed = nn.Embedding(num_embeddings=len(vocab.src.char2id),
                     embedding_dim=50,
                     padding_idx=vocab.src.char2id['<pad>'])

In [22]:
x_embed = embed(source_padded_chars)

In [23]:
x_embed.shape

torch.Size([25, 3, 21, 50])

So this is in fact the same shape as input has plus `char_embedding_dim`. Let's check that this is our word `He`.

In [24]:
x_embed[0, 0, :4, :5]

tensor([[ 0.0780,  0.5258, -0.4880,  1.1914, -0.8140],
        [-0.1678,  1.6433,  2.0071, -1.2531,  1.1189],
        [-0.6106,  1.0629,  1.2222,  0.7719, -1.2797],
        [ 0.6408,  0.5832,  1.0669, -0.4502, -0.1853]],
       grad_fn=<SliceBackward>)

In [25]:
W = embed.weight

In [26]:
W.shape

torch.Size([96, 50])

In [27]:
source_padded_chars_resh[0, 0, :4]

tensor([ 1, 11, 34,  2])

In [28]:
W[source_padded_chars_resh[0, 0, :4].numpy(), :5]

tensor([[ 0.0780,  0.5258, -0.4880,  1.1914, -0.8140],
        [-0.1678,  1.6433,  2.0071, -1.2531,  1.1189],
        [-0.6106,  1.0629,  1.2222,  0.7719, -1.2797],
        [ 0.6408,  0.5832,  1.0669, -0.4502, -0.1853]],
       grad_fn=<IndexBackward>)

But we still don't know why do we use such shape for our input. Next stop - `conv1D` layer.

### conv1D and max_pool

This is probably the most difficult place for us to undestand all the shapes. Let's start.

### conv1D

What shape should be an input to `conv1D`? As specified in docs: `(N, C, L)` where `C` is number of (input) channels, `L` is sequence length. Fortunately we have a hint what this means in our case. In `pdf` specified that input to `conv1D` should be of shape: $(e_{char}, m_{word})$.  In our case `(50, 21)`. So we have to combine the first two dimensions and swap the last two.

This is consistent with what specified in the book: 

- `number of channels` is the length of the vector (in our case `embed_size` `50`);
- `seq_len` is the length of char sequence (in our case `max_word_len` `21`);

If we're looking for an analogy with vision: characters are our pixels and instead of 3 channels `RGB` we have `embed_size` channels.

The input size to `conv1D` layer is also specified in the book: *`Conv1d` layers we will use require the data tensors to have the batch on the 0th dimension, channels on the 1st dimension, and sequence length on the 2nd.* 

In [29]:
x_embed.shape

torch.Size([25, 3, 21, 50])

In [35]:
_, _, seq_len, n_channels  = x_embed.shape

In [36]:
n_channels, seq_len

(50, 21)

In [37]:
x_embed_resh = x_embed.view(-1, n_channels, seq_len)

In [38]:
x_embed_resh.shape

torch.Size([75, 50, 21])

Let's create `conv1D` layer. We need 3 parameters:

- `in_channels` - this is `n_channels` in `x_embed` (the same as `char_embed_size`, `50`);
- `out_channels` (the same as `num_filters`) - this is an important parameter; let's see where it's used: we create `embedding` before supply data to `LSTM`; we create it with `ModelEmbeddings(embed_size, vocab.src)`; this is actually **`word_embedding_size`** and it's used as a `num_filters`; the reason for that - we use all this steps with convolution etc. just to create word embedding as specified above; we use `256` as `word_embedding_size` (as specified in `pdf`);
- `kernel_size` - predefined in `pdf` and equal to `5`;

In [40]:
torch.manual_seed(42)
char_embed_size = 50
num_filters = 256
kernel_size = 5
conv1D = nn.Conv1d(in_channels=char_embed_size,
                   out_channels=num_filters,
                   kernel_size=kernel_size,
                   bias=True)

In [41]:
x_conv1D = conv1D(x_embed_resh)

In [42]:
x_conv1D.shape

torch.Size([75, 256, 17])

This shape is not suprising:

- `n_channels` switched from `in_channels` to `out_channels`; 
- to compute an output size of applying a filter we use standard formula (from C4W1L05 cs230):

$$\frac{n + 2p - f}{s} + 1$$

In [44]:
# we don't use padding and
# use default stride = 1
(21 + 0 - 5) / 1 + 1

17.0

### max_pool

Next step is `max_pool`. It looks like it works with 3D tensors like `conv1D` tensor. And it changes only the last dimension:

- Input: $(N, C, L_{in})$
- Output: $(N, C, L_{out})$

Here's the formula for this last dimension (`stride == k` by default):

$$L_{out} = \frac{L_{in} - k}{k} + 1$$

In [72]:
(15 - 3) / 3 + 1

5.0

In [65]:
m = nn.MaxPool1d(kernel_size=3)

In [66]:
torch.manual_seed(42)
x_in = torch.randn(2, 3, 15)

In [67]:
x_in.shape

torch.Size([2, 3, 15])

In [73]:
x_in[0, 0, :]

tensor([ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047,
        -0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688])

In [68]:
x_out = m(x_in)

In [69]:
x_out.shape

torch.Size([2, 3, 5])

In [74]:
x_out[0, 0, :]

tensor([ 1.9269,  0.6784, -0.0431,  1.6487, -0.5594])

Let's check that this is the correct result.

In [76]:
torch.Tensor([torch.max(x_in[0, 0, i:i+3]) for i in range(0, 15, 3)])

tensor([ 1.9269,  0.6784, -0.0431,  1.6487, -0.5594])

What should be the `kernel_size` in our case? This is specified in `pdf`. Output of `x_conv` has shape `(batch_size, word_embed_size, max_word_len - k + 1)` or `(3, 256, 17)` in our case. After `max_pool` it will be just first 2 dimensions `((batch_size, word_embed_size)`.

In terms of `pdf`:

$$x_{conv} \in \mathbb{R}^{e_{word} \times (m_{word} - k + 1)}$$
$$x_{conv-out} \in \mathbb{R}^{e_{word}}$$

Why is that? Well it's the most popular pooling over time when the window of the filter goes via `seq_len`. What should be the `kernel_size`? If we need to remove the 2nd direction the size should be equal to its dimension.

In [77]:
max_pool = nn.MaxPool1d(kernel_size=17)

In [79]:
x_conv_out = max_pool(F.relu(x_conv1D))

In [80]:
x_conv_out.shape

torch.Size([75, 256, 1])

In [81]:
x_conv_out = x_conv_out.squeeze(dim=2)

In [82]:
x_conv_out.shape

torch.Size([75, 256])

## highway network

Last step of the part 1 is `highway` network. It's much easier to understand than `conv1D`. Shape is not changing.

In [120]:
word_embed_size = 256

In [121]:
torch.manual_seed(42)
proj = nn.Linear(in_features=word_embed_size, out_features=word_embed_size, bias=True)
gate = nn.Linear(in_features=word_embed_size, out_features=word_embed_size, bias=True)

In [122]:
x_proj = F.relu(proj(x_conv_out))
x_gate = torch.sigmoid(gate(x_conv_out))
x_highway = x_gate * x_proj + (1 - x_gate) * x_conv_out

In [123]:
x_highway.shape

torch.Size([75, 256])

This concludes the analysis of the part 1 of A5.

## tests

In [126]:
! python3 sanity_check.py 1h

--------------------------------------------------------------------------------
Running Sanity Check for Question 1h: Highway
--------------------------------------------------------------------------------
Sanity Check Passed for Question 1h: Highway!
--------------------------------------------------------------------------------


In [115]:
torch.save(x_conv_out, './sanity_check_en_es_data/1h_x_conv_out.pt')

In [124]:
torch.save(x_highway, './sanity_check_en_es_data/1h_x_highway.pt')