# 10 - Introduction to Machine Translation
Prepared by Jan Christian Blaise Cruz

DLSU Machine Learning Group

# Preliminaries

First, let's make sure that we have an active GPU.

In [None]:
!nvidia-smi

Thu Sep  3 11:58:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.66       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Then we'll download the Flickr Multi30k dataset, as well as some tokenizer models from Spacy. After you do this, make sure to **restart the runtime** by clicking Runtime > Restart Runtime in the menu bar.

In [None]:
!wget https://s3.us-east-2.amazonaws.com/blaisecruz.com/datasets/translation/multi30k.zip
!unzip multi30k.zip && rm multi30k.zip
!python -m spacy download de_core_news_sm
!python -m spacy download en_core_web_sm

Let's include our standard imports.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as datautils

import spacy
import numpy as np

import random
from collections import Counter
from tqdm import tqdm

np.random.seed(42)
torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Then load the dataset. There's no need to shuffle and split as the dataset already has predefined training and validation splits.

In [None]:
with open('multi30k/train.en', 'r') as f:
    train_en = [line.strip() for line in f]
with open('multi30k/train.de', 'r') as f:
    train_de = [line.strip() for line in f]
with open('multi30k/val.en', 'r') as f:
    valid_en = [line.strip() for line in f]
with open('multi30k/val.de', 'r') as f:
    valid_de = [line.strip() for line in f]

We'll tokenize our data. We'll add in a start of sequence and an end of sequence token. This will make it easier for the model later to learn when to stop generating tokens for the translations. We'll also make everything lowercase.

In [None]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

def tokenize_de(text):
    return ['<sos>'] + [tok.text.lower() for tok in spacy_de.tokenizer(text)] + ['<eos>']

def tokenize_en(text):
    return ['<sos>'] + [tok.text.lower() for tok in spacy_en.tokenizer(text)] + ['<eos>']

# Tokenize the text
train_en = [tokenize_en(text) for text in tqdm(train_en)]
train_de = [tokenize_de(text) for text in tqdm(train_de)]
valid_en = [tokenize_en(text) for text in tqdm(valid_en)]
valid_de = [tokenize_de(text) for text in tqdm(valid_de)]

100%|██████████| 29000/29000 [00:01<00:00, 15642.70it/s]
100%|██████████| 29000/29000 [00:02<00:00, 10186.42it/s]
100%|██████████| 1014/1014 [00:00<00:00, 20026.39it/s]
100%|██████████| 1014/1014 [00:00<00:00, 11979.41it/s]


Next up, we'll pad and cut the samples in the dataset. The maximum sequence length is the largest length in the dataset itself.

In [None]:
def process(dataset):
    max_len = max([len(text) for text in dataset])
    temp = []
    for text in dataset:
        if len(text) < max_len:
            text += ['<pad>' for _ in range(max_len - len(text))]
        temp.append(text)
    return temp

# Pad to maximum length of the dataset
train_en_proc, valid_en_proc = process(train_en), process(valid_en)
train_de_proc, valid_de_proc = process(train_de), process(valid_de)

Here's the first training example in English.

In [None]:
print(train_en_proc[0])

['<sos>', 'two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


And it's corresponding sentence in German.

In [None]:
print(train_de_proc[0])

['<sos>', 'zwei', 'junge', 'weiße', 'männer', 'sind', 'im', 'freien', 'in', 'der', 'nähe', 'vieler', 'büsche', '.', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']


Our vocabulary generation scheme remains largely the same, but with some minor improvements. We'll use ```collections.Counter``` to get token frequency counts and remove rare tokens. This ensures that the vocabulary and our embeddings don't become too sparse.

We'll generate two sets of vocabularies and index-word converters: one for English and one of German.

In [None]:
def get_vocab(dataset, min_freq=2):
    # Add all tokens to the list
    special_tokens = ['<unk>', '<pad>', '<sos>', '<eos>']
    vocab = []
    for line in dataset: vocab.extend(line)

    # Remove words that are below the minimum frequency, the enforce set
    counts = Counter(vocab)
    vocab = special_tokens + [word for word in counts.keys() if counts[word] > min_freq]
    vocab_set = set(vocab)

    # Push all special tokens to the front
    idx2word = list(vocab_set)
    for token in special_tokens[::-1]:
        idx2word.insert(0, idx2word.pop(idx2word.index(token)))

    # Produce word2idx then return
    word2idx = {idx2word[i]: i for i in range(len(idx2word))}
    return vocab_set, idx2word, word2idx

# Get vocabulary and references
vocab_set_en, idx2word_en, word2idx_en = get_vocab(train_en_proc, min_freq=2)
vocab_set_de, idx2word_de, word2idx_de = get_vocab(train_de_proc, min_freq=2)

# Convert unknown tokens
train_en_proc = [[token if token in vocab_set_en else '<unk>' for token in line] for line in train_en_proc]
train_de_proc = [[token if token in vocab_set_de else '<unk>' for token in line] for line in train_de_proc]
valid_en_proc = [[token if token in vocab_set_en else '<unk>' for token in line] for line in valid_en_proc]
valid_de_proc = [[token if token in vocab_set_de else '<unk>' for token in line] for line in valid_de_proc]

Here's the number of words in both vocabularies.

In [None]:
len(vocab_set_en), len(vocab_set_de)

(4556, 5376)

Next, we'll convert every token into its corresponding index in the vocabulary.

In [None]:
def serialize(dataset, word2idx):
    temp = []
    for line in dataset: temp.append([word2idx[token] for token in line])
    return torch.LongTensor(temp)

# Convert to idx
y_train = serialize(train_en_proc, word2idx_en)
X_train = serialize(train_de_proc, word2idx_de)
y_valid = serialize(valid_en_proc, word2idx_en)
X_valid = serialize(valid_de_proc, word2idx_de)

Then produce our dataloaders.

In [None]:
bs = 128

train_dataset = datautils.TensorDataset(X_train, y_train)
valid_dataset = datautils.TensorDataset(X_valid, y_valid)
train_sampler = datautils.RandomSampler(train_dataset)
train_loader = datautils.DataLoader(train_dataset, batch_size=bs, sampler=train_sampler)
valid_loader = datautils.DataLoader(valid_dataset, batch_size=bs, shuffle=False)

Now, since this is a text generation task and we'll be using the outputs of our RNNs, we'll have to use top-down sequentiality (like in language modeling). 

Left-right sequentiality is only useful for sequence classification tasks (like sentiment classification) as it treats one sentence as a batch. In text generation, we treat a number of tokens as a batch and the task is to generate the following tokens, which is why we use top-down.

To easily convert our left-right batches, we can use ```.rot90()```. Here's an example showing the first batch in the trainings set.

In [None]:
x, y = next(iter(train_loader))
x, y = x.rot90(k=3), y.rot90(k=3)

print(x.shape, y.shape)
print(x)
print(y)

torch.Size([46, 128]) torch.Size([43, 128])
tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [ 816, 2818,  883,  ...,  816,  883,  883],
        [1601, 3547, 5046,  ...,  490, 5046, 5046],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])
tensor([[   2,    2,    2,  ...,    2,    2,    2],
        [1916, 1722, 1916,  ..., 1916, 1916, 1916],
        [ 685, 3055, 1576,  ..., 3304, 1576, 1576],
        ...,
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1],
        [   1,    1,    1,  ...,    1,    1,    1]])


# Modeling

First up, we have our encoder.

"Encoder" is common parlance in deep learning, especially in NLP. When we say encoder, we usually refer to a way to embed sequential input. In this case, it's an embedding layer + an RNN layer that we use to "encode" the source sentence into a feature representation.

When we pass our source sentence to the encoder, it gives us hidden and cell states that we use to inform the decoder about the source sentence.

The architecture here is simple and it's something that we've seen before already in language modeling, minus the projection layer. The one improvement we'll add here is the inclusion of ```pack_padded_sequences``` which allows our RNN to disregard padding tokens.

In [None]:
class Encoder(nn.Module):
    def __init__(self, vocab_sz, embedding_dim, hidden_dim, num_layers=1, dropout=0.5):
        super(Encoder, self).__init__()
        self.embedding = nn.Embedding(vocab_sz, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers)
        self.dropout = nn.Dropout(dropout)
        self.vocab_sz = vocab_sz

    def init_hidden(self, bs):
        weight = next(self.parameters())
        hidden_dim = self.rnn.hidden_size
        layers = self.rnn.num_layers

        h = weight.new_zeros(layers, bs, hidden_dim)
        c = weight.new_zeros(layers, bs, hidden_dim)
        return h, c

    def forward(self, x, pad_idx=None):
        msl, bs = x.shape
        out = self.embedding(x)
        out = self.dropout(out)
        hidden, cell = self.init_hidden(bs)

        if pad_idx is not None:
            lens = ((x.rot90() == pad_idx) == False).int().sum(dim=1)
            out = nn.utils.rnn.pack_padded_sequence(out, lens, enforce_sorted=False)
        
        out, (hidden, cell) = self.rnn(out, (hidden, cell))

        return hidden, cell

Our decoder is something a little different from our encoder. Instead of getting a batch of sequences, it will get a batch of tokens.

The idea of sequence-to-sequence learning is that we encode the source sentence and use it to produce token after token of translations. We'll write the decoder to be used for this function.

Nothing is too special here. We pass it the hidden and cell states from the encoder, then it produces logits that predict the most likely next token. We output the hidden and cell states again to be used for the next round of decoding.

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_sz, embedding_dim, hidden_dim, num_layers=1, dropout=0.5):
        super(Decoder, self).__init__()
        self.embedding = nn.Embedding(vocab_sz, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=num_layers)
        self.fc1 = nn.Linear(hidden_dim, vocab_sz)
        self.dropout = nn.Dropout(dropout)
        self.vocab_sz = vocab_sz

    def forward(self, x, hidden, cell):
        out = self.embedding(x.unsqueeze(0))
        out = self.dropout(out)
        out, (hidden, cell) = self.rnn(out, (hidden, cell))
        out = self.fc1(out.squeeze(0))

        return out, hidden, cell

We'll wrap everything together in a Seq2Seq wrapper module.

Again, nothing too flashy. The module wraps together the encoder and decoder and defines the sequence-to-sequence generation/training scheme.

First, we encode the source sentence. Second, we set the ```<sos>``` token as the initial "start generation" token. Decode using this token and the initial hidden and cell states. Update the token to the new "predction" then loop until the entire length of the target sentence is generated.

We'll introduce a technique used in machine translation called **teacher forcing**. This technique essentially means that there is a chance that the correct answer is fed as the next token instead of the predicted token. In cases where the model strays from the correct translation, teacher forcing will let it learn back.

In [None]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, initrange=0.08):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.init_weights(initrange)

    def init_weights(self, initrange=0.08):
        for name, param in self.named_parameters():
            nn.init.uniform_(param.data, -initrange, initrange)

    def forward(self, x, y, teacher_forcing=0.5, pad_idx=None):
        src_len, bs = x.shape
        trg_len, bs = y.shape

        # Make container for outputs
        weight = next(self.encoder.parameters())
        outputs = weight.new_zeros(trg_len, bs, self.decoder.vocab_sz)

        # Encode source then prep first input
        hidden, cell = self.encoder(x, pad_idx=pad_idx)
        input_ids = y[0,:]

        # Decode per input token
        for i in range(1, trg_len):
            out, hidden, cell = self.decoder(input_ids, hidden, cell)
            outputs[i] = out

            teacher_force = random.random() < teacher_forcing
            input_ids = y[i] if teacher_force else out.argmax(1)

        return outputs

Let's test an example.

In [None]:
encoder = Encoder(vocab_sz=len(vocab_set_de), embedding_dim=100, hidden_dim=256)
decoder = Decoder(vocab_sz=len(vocab_set_en), embedding_dim=100, hidden_dim=256)
model = Seq2Seq(encoder, decoder)

criterion = nn.CrossEntropyLoss()

Get some initial data.

In [None]:
x, y = next(iter(train_loader))
x, y = x.rot90(k=3), y.rot90(k=3)

And test.

In [None]:
out = model(x, y, pad_idx=word2idx_de['<pad>'])
print(out.shape)

torch.Size([43, 128, 4556])


To get the loss, we disregard the ```<sos>``` token. We do the same as we do in language modeling, which is flattening the outputs into 2 dimensions.

In [None]:
loss = criterion(out[1:].flatten(0, 1), y[1:].flatten(0))

Here's our initial loss.

In [None]:
loss

tensor(8.3815, grad_fn=<NllLossBackward>)

# Training

Time to put everything together.

We'll initialize an encoder and decoder with the same hyperparameters (we need to do this to ensure that the hidden and cell shapes remain the same). Wrap them up in a Seq2Seq wrapper, then pass to the GPU.

Adam is still our optimizer of choice. We'll use cosine annealing as a basic scheduler for this example.

In [None]:
encoder = Encoder(vocab_sz=len(vocab_set_de), embedding_dim=256, hidden_dim=512, num_layers=2, dropout=0.5)
decoder = Decoder(vocab_sz=len(vocab_set_en), embedding_dim=256, hidden_dim=512, num_layers=2, dropout=0.5)
model = Seq2Seq(encoder, decoder).to(device)

optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=word2idx_en['<pad>'])

epochs = 10
iters = epochs * len(train_loader)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=iters, eta_min=0)

Here's the number of parameters in our model.

In [None]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print("The model has {:,} trainable parameters".format(count_parameters(model)))

The model has 12,236,236 trainable parameters


We'll train our model for 10 epochs, setting gradient clipping to 1.0.

In [None]:
clip = 1.0

for e in range(1, epochs + 1):
    train_loss = 0
    
    model.train()
    for x, y in tqdm(train_loader):
        x, y = x.rot90(k=3).to(device), y.rot90(k=3).to(device)

        out = model(x, y, pad_idx=word2idx_de['<pad>'])
        loss = criterion(out[1:].flatten(0, 1), y[1:].flatten(0))
        
        optimizer.zero_grad()
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        scheduler.step()

        train_loss += loss.item()
    train_loss /= len(train_loader)

    valid_loss = 0
    
    model.eval()
    with torch.no_grad():
        for x, y in tqdm(valid_loader):
            x, y = x.rot90(k=3).to(device), y.rot90(k=3).to(device)

            out = model(x, y, pad_idx=word2idx_de['<pad>'])
            loss = criterion(out[1:].flatten(0, 1), y[1:].flatten(0))

            valid_loss += loss.item()
    valid_loss /= len(valid_loader)

    print("\nEpoch {:3} | Train Loss {:.4f} | Train Ppl {:.4f} | Valid Loss {:.4f} | Valid Ppl {:.4f}".format(e, train_loss, np.exp(train_loss), valid_loss, np.exp(valid_loss)))

100%|██████████| 227/227 [01:16<00:00,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 13.82it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   1 | Train Loss 4.8932 | Train Ppl 133.3773 | Valid Loss 4.3267 | Valid Ppl 75.6950


100%|██████████| 227/227 [01:16<00:00,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 13.66it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   2 | Train Loss 4.2014 | Train Ppl 66.7812 | Valid Loss 3.9796 | Valid Ppl 53.4944


100%|██████████| 227/227 [01:16<00:00,  2.95it/s]
100%|██████████| 8/8 [00:00<00:00, 13.62it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   3 | Train Loss 3.9506 | Train Ppl 51.9646 | Valid Loss 3.8309 | Valid Ppl 46.1054


100%|██████████| 227/227 [01:16<00:00,  2.95it/s]
100%|██████████| 8/8 [00:00<00:00, 13.56it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   4 | Train Loss 3.7697 | Train Ppl 43.3692 | Valid Loss 3.6327 | Valid Ppl 37.8142


100%|██████████| 227/227 [01:16<00:00,  2.95it/s]
100%|██████████| 8/8 [00:00<00:00, 13.64it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   5 | Train Loss 3.6098 | Train Ppl 36.9603 | Valid Loss 3.5103 | Valid Ppl 33.4571


100%|██████████| 227/227 [01:16<00:00,  2.95it/s]
100%|██████████| 8/8 [00:00<00:00, 13.64it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   6 | Train Loss 3.4949 | Train Ppl 32.9486 | Valid Loss 3.4151 | Valid Ppl 30.4204


100%|██████████| 227/227 [01:16<00:00,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 13.62it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   7 | Train Loss 3.4229 | Train Ppl 30.6586 | Valid Loss 3.4263 | Valid Ppl 30.7632


100%|██████████| 227/227 [01:16<00:00,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 13.75it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   8 | Train Loss 3.3689 | Train Ppl 29.0469 | Valid Loss 3.3817 | Valid Ppl 29.4221


100%|██████████| 227/227 [01:16<00:00,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 13.79it/s]
  0%|          | 0/227 [00:00<?, ?it/s]


Epoch   9 | Train Loss 3.3143 | Train Ppl 27.5032 | Valid Loss 3.3889 | Valid Ppl 29.6347


100%|██████████| 227/227 [01:16<00:00,  2.96it/s]
100%|██████████| 8/8 [00:00<00:00, 13.72it/s]


Epoch  10 | Train Loss 3.3131 | Train Ppl 27.4696 | Valid Loss 3.4424 | Valid Ppl 31.2607





If you trained the model yourself, you can save the weights to resue it later.

In [None]:
torch.save(model.state_dict(), 'seq2seq.pt')

# Sampling

Next, we want to see our translation model in action.

Let's load the model and see it in action. Put it in the CPU to prevent GPU overallocation, then set it to evaluation mode.

In [None]:
encoder = Encoder(vocab_sz=len(vocab_set_de), embedding_dim=256, hidden_dim=512, num_layers=2, dropout=0.5)
decoder = Decoder(vocab_sz=len(vocab_set_en), embedding_dim=256, hidden_dim=512, num_layers=2, dropout=0.5)
model = Seq2Seq(encoder, decoder).to(device)

model.load_state_dict(torch.load('seq2seq.pt'))
model.eval();

In [None]:
model.eval()
valid_loss = 0
with torch.no_grad():
    for x, y in tqdm(valid_loader):
        x, y = x.rot90(k=3).to(device), y.rot90(k=3).to(device)

        out = model(x, y, pad_idx=word2idx_de['<pad>'])
        loss = criterion(out[1:].flatten(0, 1), y[1:].flatten(0))

        valid_loss += loss.item()
valid_loss /= len(valid_loader)

print("\nValid Loss {:.4f} | Valid Ppl {:.4f}".format(valid_loss, np.exp(valid_loss)))

100%|██████████| 8/8 [00:00<00:00, 11.12it/s]


Valid Loss 3.3153 | Valid Ppl 27.5306





For this example, we'll use the first example in the first batch.

In [None]:
x, y = next(iter(valid_loader))
x, y = x.rot90(k=3), y.rot90(k=3)
sample = x[:, 0].unsqueeze(1)

This is the source sentence.

In [None]:
print(' '.join([idx2word_de[idx] for idx in list(sample.squeeze(1).numpy())]))

<sos> ein kleines rothaariges mädchen in einem <unk> reitet auf einem spielzeugpferd . <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Here's the correct translation.

In [None]:
print(' '.join([idx2word_en[idx] for idx in list(y[:, 0].numpy())]))

<sos> a little redheaded girl wears a spider - man suit while riding a play horse . <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


We'll use basic multinomial sampling for this.

Our sampling scheme is similar to seq2seq learning, minus the teacher forcing. We first encode the source sequence, then greedily take the maximum of the logits as the next token. Feed that next token over and over until we hit the max number of tokens, or we hit the end of sequence token.

In [None]:
 predictions = []
 max_words = 20
 temperature = 0.8
 torch.manual_seed(1111)

 model = model.cpu()

 with torch.no_grad():
    hidden, cell = model.encoder(sample)
    token = sample[0]

    for _ in range(max_words):
        out, hidden, cell = model.decoder(token, hidden, cell)
        weights = torch.softmax(out / temperature, dim=-1)
        token = torch.multinomial(weights, 1).squeeze(0)

        predictions.append(token.item())

        if token.item() == word2idx_en['<eos>']:
            break

print(' '.join([idx2word_en[ix] for ix in predictions]))

a smiling blond girl wearing a yellow sweater is sitting on a green bench slide . <eos>


Our model seems to know that there's a little girl and she is wearing a certain type of clothing. The other details weren't translated, but hey, that's pretty good for our first translation model!

The basic seq2seq scheme has problems:
1. It's highly unlikely the the decoder will retain the information learned by the encoder about the source sentence, especially for long sequences.
2. The model has no way to learn alignment.
3. Once the model has overwritten the hidden and cell states, there is no way for it to refer back to the source information.

In the next notebook, we'll learn how to leverage the attention mechanism to solve these problems.