In [1]:
import numpy as np
import operator
from torch import optim
import torch.nn.functional as F
import torch.nn as nn
import torch
import sys

sys.path.append(".")
import utils

This is an introduction to basic sequence-to-sequence learning using a Long short term memory (LSTM) module.
Given a string of characters representing a math problem "3141+42" we would like to generate a string of characters representing the correct solution: "3183". Our network will learn how to do basic mathematical operations.
The important part is that we will not first use our human intelligence to break the string up into integers and a mathematical operator. We want the computer to figure all that out by itself.
Each math problem is an input sequence: a list of {0,...,9} integers and math operation symbols
The result of the operation ("$3141+42$" $\rightarrow$ "$3183$"</span>) is the sequence to decode.

**math_operators** is the set of $5$ operations we are going to use to build are input sequences.<br/>
The math_expressions_generation function uses them to generate a large set of examples

In [2]:

def math_expressions_generation(n_samples=1000, n_digits=3, invert=True):
    X, Y = [], []
    math_operators = {
        "+": operator.add,
        "-": operator.sub,
        "*": operator.mul,
        "/": operator.truediv,
        "%": operator.mod,
    }
    for i in range(n_samples):
        a, b = np.random.randint(1, 10 ** n_digits, size=2)
        op = np.random.choice(list(math_operators.keys()))
        res = math_operators[op](a, b)
        x = "".join([str(elem) for elem in (a, op, b)])
        if invert is True:
            x = x[::-1]
        y = "{:.5f}".format(res) if isinstance(res, float) else str(res)
        X.append(x)
        Y.append(y)
    return X, Y


In [3]:
quick_for_debugg = False
n_samples = 100 if quick_for_debugg else int(1e5)

X, y = math_expressions_generation(n_samples=n_samples, n_digits=3, invert=True)
for X_i, y_i in list(zip(X, y))[:20]:
    print(X_i[::-1], "=", y_i)

998%529 = 469
168+795 = 963
733*375 = 274875
354-460 = -106
435%253 = 182
378+236 = 614
188-295 = -107
818%920 = 818
140+333 = 473
526%202 = 122
394-83 = 311
623+471 = 1094
155/639 = 0.24257
345%809 = 345
465-865 = -400
136/5 = 27.20000
324+884 = 1208
966/34 = 28.41176
335*802 = 268670
334%838 = 334


# I - Encoder and decoder models

- encoder and decoder are both GRU models
- encoder and decoder both take an input sequence and output $1$ hidden vector for each step in input sequence
- the decoder also outputs $1$ softmax per step in input sequence, that corresponds to the next predicted token

In the next cells the example is:
- sequence to encode: 94+8
- sequence to decode: $102\text{<EOS>}$

**NB: In this TP all tensors have a $\text{batch_size}$ axis in addition to the traditional $\text{nb_timesteps, vector_dim}$ axes.**

**The batch size axis is there because pytorch GRU (and most other pytorch layers) can process tensors organized in batch, meaning that contain several sequences.**

**In the returned tensor, the results for each sequence are given along a batch axis.**

**encoder and decoder inputs**
- for the encoder, the input sequence is the operation: $94+8$

<img src="../images/encoder_input.png" style="width: 600px;" />

- for the decoder, if using teacher forcing, the input sequence is the off-set of the sequence to decode: $\text{<GO>}102$

<img src="../images/decoder_input_all.png" style="width: 600px;" />

- for the decoder, if **not** using teacher forcing, the input sequence is $1$ timestep long and is either the $\text{<GO>}$ token or the previous predicted token:

<img src="../images/decoder_input_one.png" style="width: 600px;" />

for the decoder those $3$ scenarios are one: the input sequence is of shape $(\text{nb_timesteps, batch_size, input_dim})$, the decoder goes through all timesteps for each sequence, produces $1$ hidden vector and $1$ prediction per timestep

**no attention vs attention**

the attention mechanism is handled (and implemented) at the decoder level

**no attention**

<img src="../images/decoder_no_attention_all.png" style="width: 900px;" />

At each timestep, the hidden vector is used to predict the next token


**attention**

<img src="../images/decoder_attention_all.png" style="width: 900px;" />

The attention mechanism here is of type that is performed over the decoder hidden vectors after they are produced.
- For each timestep of the decoder input, similarity between the decoder hidden vector and all the encoder hidden vectors is computed. It allows to determine which token in encoder input to focus on. Here similarity is just a dot product $hdec^T \cdot henc$ between the vectors.
- For each timestep of the decoder input, pass this "attention weights" vector to a softmax so the weights sum to $1$.
- For each timestep of the decoder input, compute a weighted sum of the encoder hidden vectors. This is the context vector. The fact that it is more or less heavily weighted towards certain encoder hidden vector relates to the tokens the algorithm focuses on.
- Use the context vector to predict the next token by performing a matrix product to set at the right dimension and apply a softmax.

In [4]:
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, device):
        super(EncoderRNN, self).__init__()
        self.device = device
        self.hidden_size = hidden_size
        self.gru = nn.GRU(input_size, hidden_size).to(self.device)

    def forward(self, encoder_input, henc_init=None):
        if henc_init is None:
            henc_init = torch.zeros(
                1, encoder_input.size()[1], self.hidden_size, device=self.device
            ).to(self.device)
        henc_ts, henc_final = self.gru(encoder_input, henc_init)
        return henc_ts, henc_final


In [5]:
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, device, attention=False):
        super(DecoderRNN, self).__init__()
        self.device = device
        self.hidden_size = hidden_size
        self.gru = nn.GRU(output_size, hidden_size).to(self.device)
        self.linear = nn.Linear(hidden_size, output_size).to(self.device)
        self.attention = attention

    def forward(self, decoder_input, hdec_init, henc_ts=None):

        hdec_ts, hdec_final = self.gru(decoder_input, hdec_init)
        if self.attention:
            assert henc_ts is not None
            # (ts_dec, batch, dim) to (batch, ts_dec, dim)
            hdec_ts = hdec_ts.permute(1, 0, 2)
            # (ts_enc, batch, dim) to (batch, dim, ts_enc)
            henc_ts = henc_ts.permute(1, 2, 0)
            # (batch, ts_dec, ts_enc)
            attn_weights_dec_to_enc = torch.bmm(hdec_ts, henc_ts)
            # (batch, ts_dec, ts_enc) to (ts_dec, batch, ts_enc)
            attn_weights_dec_to_enc = attn_weights_dec_to_enc.permute(1, 0, 2)
            # (batch, dim, ts_enc) to (batch, ts_enc, dim)
            henc_ts = henc_ts.permute(0, 2, 1)
            attn_weights_dec_to_enc = F.softmax(attn_weights_dec_to_enc, dim=2)
            # (batch, ts_enc, dim) to (ts_dec, batch, ts_enc, dim)
            henc_ts = henc_ts.unsqueeze(0).repeat(
                attn_weights_dec_to_enc.size()[0], 1, 1, 1
            )
            # (ts_dec, batch, ts_enc) to (ts_dec, batch, ts_enc, 1)
            attn_weights_dec_to_enc = attn_weights_dec_to_enc.unsqueeze(3)
            # (ts_dec, batch, ts_enc, 1) x (ts_dec, batch, ts_enc, dim) --> (ts_dec, batch, ts_enc, dim)
            context_vectors = attn_weights_dec_to_enc * henc_ts
            # (ts_dec, batch, ts_enc, dim) to (ts_dec, batch, dim)
            context_vectors = context_vectors.sum(2)
            output = F.log_softmax(self.linear(context_vectors), dim=2)
        else:
            output = F.log_softmax(self.linear(hdec_ts), dim=2)
        return output, hdec_final

# II - Sequence to sequence model

**GO** is the character ("=") that marks the beginning of decoding for the decoder GRU<br/>
**EOS** is the character ("\n") that marks the end of sequence to decode for the decoder GRU

**global Seq2seq architecture (teacher forcing scenario)**

<img src="../images/seq2seq_teacher.png" style="width: 1000px;" />

the teacher forcing mechanism is handled (and implemented) at the seq2seq forward pass level.

teacher forcing or no teacher forcing depends on the kind of input passed to the decoder.

**teacher forcing**

<img src="../images/teacher_forcing.png" style="width: 600px;" />

- the decoder input is the sequence of expected decoded tokens at all timesteps.
- the decoder input is passed in one go to the decoder. The decoder goes through all timesteps and decodes the whole sequence in one go.
- the decoder input is of shape $(\text{nb_timesteps, batch_size, input_dim})$.

**no teacher forcing**

<img src="../images/no_teacher_forcing.png" style="width: 1000px;" />

- the decoder input is $1$ timestep long and either the $\text{GO}$ token or the previous decoded token
- the decoder inputs are passed iteratively in many stages to the decoder. For each stage, the decoder is given as state the previous returned hidden vector and take as input the previous decoded token. It produces a new hidden vector and decoded token that are returned for the next stage.
- the decoder input for each stage is of shape $(\text{1, batch_size, input_dim})$.

In [6]:
class Seq2seq(nn.Module):
    def __init__(
        self, X, y, hidden_size=256, learning_rate=0.01, attention=False
    ):
        super(Seq2seq, self).__init__()
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        self.X = X
        self.y = y
        self.GO = "="
        self.EOS = "\n"
        self.dataset_size = None
        self.encoder_char_index = None
        self.encoder_index_char = None
        self.decoder_char_index = None
        self.decoder_index_char = None
        self.encoder_vocabulary_size = None
        self.decoder_vocabulary_size = None
        self.max_encoder_sequence_length = None
        self.max_decoder_sequence_length = None
        self.encoder_input_tr = None
        self.encoder_input_val = None
        self.decoder_input_tr = None
        self.decoder_input_val = None
        self.target_tr = None
        self.target_val = None
        self._set_data_properties_attributes()
        self._construct_data_set()
        self.encoder = EncoderRNN(
            input_size=self.encoder_vocabulary_size,
            hidden_size=hidden_size,
            device=self.device,
        )
        self.decoder = DecoderRNN(
            hidden_size=hidden_size,
            output_size=self.decoder_vocabulary_size,
            attention=attention,
            device=self.device,
        )
        self.parameters = list(self.encoder.parameters()) + list(
            self.decoder.parameters()
        )
        self.optimizer = optim.Adam(self.parameters, lr=learning_rate)
        self.criterion = nn.NLLLoss(reduction="mean")
        # training attributes
        self.total_loss = None
        self.total_loss_nb_samples = None

    def _set_data_properties_attributes(self):
        self.y = list(map(lambda token: self.GO + token + self.EOS, self.y))
        self.dataset_size = len(self.X)
        encoder_characters = sorted(list(set("".join(self.X))))
        decoder_characters = sorted(list(set("".join(self.y))))
        decoder_characters.remove(self.EOS)
        # set EOS at 0 index so argmax on zero vector falls at EOS
        decoder_characters = [self.EOS] + decoder_characters
        self.encoder_char_index = dict((c, i) for i, c in enumerate(encoder_characters))
        self.encoder_index_char = dict((i, c) for i, c in enumerate(encoder_characters))
        self.decoder_char_index = dict((c, i) for i, c in enumerate(decoder_characters))
        self.decoder_index_char = dict((i, c) for i, c in enumerate(decoder_characters))
        self.encoder_vocabulary_size = len(self.encoder_char_index)
        self.decoder_vocabulary_size = len(self.decoder_char_index)
        self.max_encoder_sequence_length = max([len(sequence) for sequence in self.X])
        self.max_decoder_sequence_length = max([len(sequence) for sequence in self.y])
        print("Number of samples:", self.dataset_size)
        print("Number of unique encoder tokens:", self.encoder_vocabulary_size)
        print("Number of unique decoder tokens:", self.decoder_vocabulary_size)
        print("Max sequence length for encoding:", self.max_encoder_sequence_length)
        print("Max sequence length for decoding:", self.max_decoder_sequence_length)

    def _construct_data_set(self):
        encoder_input = torch.zeros(
            (
                self.max_encoder_sequence_length,
                self.dataset_size,
                self.encoder_vocabulary_size,
            ),
            dtype=torch.float32,
        )
        decoder_input = torch.zeros(
            (
                self.max_decoder_sequence_length,
                self.dataset_size,
                self.decoder_vocabulary_size,
            ),
            dtype=torch.float32,
        )
        target = torch.zeros(
            (
                self.max_decoder_sequence_length,
                self.dataset_size,
                self.decoder_vocabulary_size,
            ),
            dtype=torch.float32,
        )

        for i, (X_i, y_i) in enumerate(zip(self.X, self.y)):
            for t, char in enumerate(X_i):
                encoder_input[t, i, self.encoder_char_index[char]] = 1.0
            for t, char in enumerate(y_i):
                decoder_input[t, i, self.decoder_char_index[char]] = 1.0
                if t > 0:
                    target[t - 1, i, self.decoder_char_index[char]] = 1.0

        p_val = 0.25
        size_val = int(p_val * self.dataset_size)
        idxs = np.arange(self.dataset_size)
        np.random.shuffle(idxs)
        idxs_tr = idxs[:-size_val]
        idxs_val = idxs[-size_val:]
        (
            self.encoder_input_tr,
            self.encoder_input_val,
            self.decoder_input_tr,
            self.decoder_input_val,
            self.target_tr,
            self.target_val,
        ) = (
            encoder_input[:, idxs_tr, :],
            encoder_input[:, idxs_val, :],
            decoder_input[:, idxs_tr, :],
            decoder_input[:, idxs_val, :],
            target[:, idxs_tr, :],
            target[:, idxs_val, :],
        )
        self.encoder_input_tr = self.encoder_input_tr.to(self.device)
        self.encoder_input_val = self.encoder_input_val.to(self.device)
        self.decoder_input_tr = self.decoder_input_tr.to(self.device)
        self.decoder_input_val = self.decoder_input_val.to(self.device)
        self.target_tr = self.target_tr.to(self.device)
        self.target_val = self.target_val.to(self.device)

    def forward(
        self, encoder_input, decoder_input=None, teacher_enforce=True, inference=False
    ):

        batch_size = encoder_input.size()[1]
        if inference:
            assert (
                batch_size == 1
            ), "during inference batch size must be 1: 1 sequence processed"
            if teacher_enforce:
                print("Warning teacher_enforce will be set to False for inference")
                teacher_enforce = False

        henc_ts, henc_final = self.encoder(encoder_input)

        if teacher_enforce:
            assert decoder_input is not None
            pred_softmax_all_ts, hdec_final = self.decoder(
                decoder_input,
                hdec_init=henc_final,
                henc_ts=henc_ts if self.decoder.attention else None,
            )

        elif not teacher_enforce:
            pred_softmax_all_ts = []
            decoder_input = torch.zeros(1, batch_size, self.decoder_vocabulary_size)
            decoder_input[0, :, self.decoder_char_index[self.GO]] = 1
            decoder_input = decoder_input.to(self.device)
            hdec_init = henc_final
            # iterate over all decoder stages
            for _ in range(self.max_decoder_sequence_length):
                pred_softmax, hdec_final = self.decoder(
                    decoder_input,
                    hdec_init=hdec_init,
                    henc_ts=henc_ts if self.decoder.attention else None,
                )
                pred_softmax_all_ts.append(pred_softmax)
                # convert softmax predictions to idx
                preds_idx = pred_softmax.argmax(dim=2)
                # convert idx predictions to one-hot encoding
                decoder_input = torch.zeros(1, batch_size, self.decoder_vocabulary_size)
                decoder_input = decoder_input.to(self.device)
                decoder_input[0, np.arange(batch_size), preds_idx] = 1

                hdec_init = hdec_final
                if inference:
                    pred = preds_idx.squeeze().item()
                    if pred == self.decoder_char_index[self.EOS]:
                        break
            pred_softmax_all_ts = torch.cat(pred_softmax_all_ts)

        return pred_softmax_all_ts

    def _train_on_batch(
        self, encoder_input, target, teacher_forcing, decoder_input=None
    ):
        self.optimizer.zero_grad()
        prediction = self.forward(
            encoder_input, decoder_input=decoder_input, teacher_enforce=teacher_forcing
        )
        target_idx = target.argmax(2)
        loss_on_batch = self.criterion(
            prediction.reshape(-1, prediction.size()[2]), target_idx.reshape(-1)
        )
        loss_on_batch.backward()
        self.optimizer.step()

        return loss_on_batch

    def train(self, nb_epoch=10, batch_size=64, teacher_enforce=True):
        arr = np.arange(self.encoder_input_tr.size()[1])
        np.random.shuffle(arr)
        nb_batch = int(self.encoder_input_tr.size()[1] / batch_size)
        verbose_every = 5 if nb_batch >= 5 else 1

        for epoch in range(nb_epoch):
            self._reset_monitor_train_epoch()
            if epoch > 0:
                print()
            for batch_idx in range(nb_batch):
                idxs = arr[batch_idx * batch_size : (batch_idx + 1) * batch_size]
                encoder_input_batch_tr = self.encoder_input_tr[:, idxs, :]
                target_batch_tr = self.target_tr[:, idxs, :]
                decoder_input_batch_tr = self.decoder_input_tr[:, idxs, :]

                batch_loss_tr = self._train_on_batch(
                    encoder_input_batch_tr,
                    target_batch_tr,
                    teacher_forcing=teacher_enforce,
                    decoder_input=decoder_input_batch_tr,
                )
                self._monitor_train_epoch(
                    batch_loss=batch_loss_tr,
                    batch_size=encoder_input_batch_tr.size()[1],
                )

                if (batch_idx + 1) % verbose_every == 0:
                    self._display_training(
                        epoch, nb_epoch, batch_idx, nb_batch, epoch_ended=False
                    )

            self._monitor_validation(teacher_enforce=teacher_enforce)
            self._display_training(
                epoch, nb_epoch, batch_idx, nb_batch, epoch_ended=True
            )

    def _monitor_train_epoch(self, batch_loss, batch_size):
        self.total_loss += batch_loss * batch_size
        self.total_loss_nb_samples += batch_size

    def _reset_monitor_train_epoch(self):
        self.total_loss = 0
        self.total_loss_nb_samples = 0

    def _monitor_validation(self, teacher_enforce):

        prediction_val = self(
            self.encoder_input_val,
            decoder_input=self.decoder_input_val,
            teacher_enforce=teacher_enforce,
        )
        target_val_idx = self.target_val.argmax(2)
        self.last_loss_val = self.criterion(
            prediction_val.reshape(-1, prediction_val.size()[2]),
            target_val_idx.reshape(-1),
        )

    def _display_training(
        self, epoch, nb_epoch, idx_batch, nb_batch, epoch_ended=False
    ):
        msg = "Epoch {}/{} {} {}".format(
            epoch + 1,
            nb_epoch,
            utils.arrow(idx_batch + 1, nb_batch),
            " mean loss: %.5f" % (self.total_loss.item() / self.total_loss_nb_samples),
        )
        if epoch_ended:
            msg += " val loss: %.5f" % self.last_loss_val
        print(msg, end="\r")

    def _tensor_to_words(self, output, decoded=True):
        dict_index_char = (
            self.decoder_index_char if decoded else self.encoder_index_char
        )
        pred_idx = output.argmax(dim=2)
        decoded_words = []
        for seq in range(pred_idx.size()[1]):
            idxs_chars = pred_idx[:, seq]
            decoded_word = "".join(dict_index_char[idx.item()] for idx in idxs_chars)
            if not decoded:
                # correct errors due to zero vectors at the end
                accepted_end_chars = set(list("0123456789"))
                for i in range(len(decoded_word) - 1, -1, -1):
                    if decoded_word[i] in accepted_end_chars:
                        decoded_word = decoded_word[: i + 1]
                        break
            decoded_words.append(decoded_word)
        return decoded_words
    
    def evaluate(self, nb=30):
        nb = min(nb, self.encoder_input_val.size()[1])
        for i in range(nb):
            output = self(
                self.encoder_input_val[:, i : i + 1, :],
                inference=True,
                teacher_enforce=False
            )
            decoded_word = self._tensor_to_words(output, decoded=True)[0]
            operation = self._tensor_to_words(
                self.encoder_input_val[:, i : i + 1, :], decoded=False
            )[0][::-1]
            expected_decoded_word = self._tensor_to_words(
                self.target_val[:, i : i + 1, :], decoded=True
            )[0]
            decoded_word = decoded_word.replace("\n", "")
            operation = operation.replace("\n", "")
            expected_decoded_word = expected_decoded_word.replace("\n", "")
            print(
                "Input sentence: {} Decoded sentence: {} Expected decoded sentence: {}".format(
                    operation, decoded_word, expected_decoded_word
                )
            )
            print()


### no attention - teacher forcing

In [7]:
seq2seq = Seq2seq(X, y, hidden_size=128, attention=False)

Number of samples: 100000
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 11


In [8]:
seq2seq.train(nb_epoch=3, batch_size=64, teacher_enforce=True)



In [9]:
seq2seq.evaluate()

Input sentence: 431%156 Decoded sentence: 117 Expected decoded sentence: 119

Input sentence: 95-444 Decoded sentence: -371 Expected decoded sentence: -349

Input sentence: 526/266 Decoded sentence: 2.05000 Expected decoded sentence: 1.97744

Input sentence: 408/219 Decoded sentence: 1.98339 Expected decoded sentence: 1.86301

Input sentence: 95-763 Decoded sentence: -662 Expected decoded sentence: -668

Input sentence: 155/132 Decoded sentence: 1.15000 Expected decoded sentence: 1.17424

Input sentence: 507%31 Decoded sentence: 14 Expected decoded sentence: 11

Input sentence: 762/720 Decoded sentence: 1.10000 Expected decoded sentence: 1.05833

Input sentence: 10-507 Decoded sentence: -527 Expected decoded sentence: -497

Input sentence: 836/826 Decoded sentence: 1.02527 Expected decoded sentence: 1.01211

Input sentence: 951*199 Decoded sentence: 183919 Expected decoded sentence: 189249

Input sentence: 319*36 Decoded sentence: 10396 Expected decoded sentence: 11484

Input sentence:

### no attention - no teacher forcing 

In [10]:
seq2seq = Seq2seq(X, y, hidden_size=128, attention=False)

Number of samples: 100000
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 11


In [11]:
seq2seq.train(nb_epoch=3, batch_size=64, teacher_enforce=False)



In [12]:
seq2seq.evaluate()

Input sentence: 555%26 Decoded sentence: 31 Expected decoded sentence: 9

Input sentence: 425*242 Decoded sentence: 101520 Expected decoded sentence: 102850

Input sentence: 457/901 Decoded sentence: 0.43334 Expected decoded sentence: 0.50721

Input sentence: 969%852 Decoded sentence: 13 Expected decoded sentence: 117

Input sentence: 747*597 Decoded sentence: 433371 Expected decoded sentence: 445959

Input sentence: 538+770 Decoded sentence: 1222 Expected decoded sentence: 1308

Input sentence: 867-879 Decoded sentence: -72 Expected decoded sentence: -12

Input sentence: 507+635 Decoded sentence: 1132 Expected decoded sentence: 1142

Input sentence: 553/791 Decoded sentence: 0.73334 Expected decoded sentence: 0.69912

Input sentence: 767+859 Decoded sentence: 1632 Expected decoded sentence: 1626

Input sentence: 508+276 Decoded sentence: 832 Expected decoded sentence: 784

Input sentence: 720-749 Decoded sentence: -11 Expected decoded sentence: -29

Input sentence: 2/953 Decoded sente

### attention - teacher forcing 

In [13]:
seq2seq_attn = Seq2seq(X, y, hidden_size=128, attention=True)

Number of samples: 100000
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 11


In [14]:
seq2seq_attn.train(nb_epoch=3, batch_size=64, teacher_enforce=True)



In [15]:
seq2seq_attn.evaluate()

Input sentence: 960+704 Decoded sentence: 1446 Expected decoded sentence: 1664

Input sentence: 829*455 Decoded sentence: 344445 Expected decoded sentence: 377195

Input sentence: 744+178 Decoded sentence: 948 Expected decoded sentence: 922

Input sentence: 543*980 Decoded sentence: 555500 Expected decoded sentence: 532140

Input sentence: 381-618 Decoded sentence: -243 Expected decoded sentence: -237

Input sentence: 175+372 Decoded sentence: 333 Expected decoded sentence: 547

Input sentence: 1/949 Decoded sentence: 0.04444 Expected decoded sentence: 0.00105

Input sentence: 9*39 Decoded sentence: 2449 Expected decoded sentence: 351

Input sentence: 556+629 Decoded sentence: 1147 Expected decoded sentence: 1185

Input sentence: 246-167 Decoded sentence: 103 Expected decoded sentence: 79

Input sentence: 970*878 Decoded sentence: 750000 Expected decoded sentence: 851660

Input sentence: 680*248 Decoded sentence: 144400 Expected decoded sentence: 168640

Input sentence: 399*703 Decoded

### attention - no teacher forcing 

In [16]:
seq2seq_attn = Seq2seq(X, y, hidden_size=128, attention=True)

Number of samples: 100000
Number of unique encoder tokens: 15
Number of unique decoder tokens: 14
Max sequence length for encoding: 7
Max sequence length for decoding: 11


In [17]:
seq2seq_attn.train(nb_epoch=3, batch_size=64, teacher_enforce=False)



In [18]:
seq2seq_attn.evaluate()

Input sentence: 976-647 Decoded sentence: 169 Expected decoded sentence: 329

Input sentence: 340/474 Decoded sentence: 0.61666 Expected decoded sentence: 0.71730

Input sentence: 576%420 Decoded sentence: 10 Expected decoded sentence: 156

Input sentence: 996+93 Decoded sentence: 1055 Expected decoded sentence: 1089

Input sentence: 836-439 Decoded sentence: 299 Expected decoded sentence: 397

Input sentence: 478-882 Decoded sentence: -316 Expected decoded sentence: -404

Input sentence: 832-635 Decoded sentence: 155 Expected decoded sentence: 197

Input sentence: 533%312 Decoded sentence: 11 Expected decoded sentence: 221

Input sentence: 415%987 Decoded sentence: 305 Expected decoded sentence: 415

Input sentence: 758/46 Decoded sentence: 10.00000 Expected decoded sentence: 16.47826

Input sentence: 297%183 Decoded sentence: 15 Expected decoded sentence: 114

Input sentence: 247/434 Decoded sentence: 0.55555 Expected decoded sentence: 0.56912

Input sentence: 455/215 Decoded sentenc

### Questions:
- 1) Explain the interest in using teacher forcing during training. What is specific about this process?
<span style="color:green">
Teacher forcing means that, at each timestep, we provide the previous correct token to be decoded to the decoder. In case we did not use it the prediction would have been provided. What is interesting with this process is it helps stabilizing training by focusing updates on wrong weights' states leading to actual misanswers from the network, not weights in right state which would have provided correct answer but for being given the correct input.
</span>
- 2) Describe step by step how the encoder-decoder couple works in this case (~ 5-10 lines)
<span style="color:green">
There are $2$ symmetric networks, $2$ LSTMs, that have similar purposes. The first one, the encoder, processes vectors, $1$ at a time, and outputs a vector, $h_{t}^{enc}$. The position in vector space of $h_{t}^{enc}$ tells the decoder which token to predict first and contains information about all the vectors that have been processed. Thus at time $t$ there are $2$ choices: either provide $h_{t}^{enc}$ to the decoder so it knows which $1^{st}$ token to decode or process the next intput vector and produce $h_{t+1}^{enc}$.
The final $h_{T}^{enc}$ is provided to the decoder along with the GO token. Position of $h_{T}^{enc}$ and GO token combined lead to a $h_{0}^{dec}$ vector to be produced, used as input by a fully connected layer to predict the first decoded token $\hat{y}^{0}$. We iterate over this process, only instead providing $h_{t-1}^{dec}$ and true $\hat{y}^{t-1}$ as input, until it predicts the $\text{<EOS>}$ decoded token.
</span>

### Questions:
- 1) Describe how the attention mechanism works in the seq2seq setting (~ 5-10 lines)
<span style="color:green">
Attention mechanism works by being able to focus on a specific subsequence in a long sequence to predict the right token at some timestep. That means not having to rely solely on final $h_{T}^{enc}$ to predict the whole decoded sequence, but rather recombining and weighing all the $h_{t}^{enc}$ at each decoding step to focus those related to the prediction.
At each decoding step, a scalar product is performed between $h_{t}^{dec}$ and all the $h_{t}^{enc}$. This gives a similarity measure between $h_{t}^{dec}$ and each $h_{t}^{enc}$. A softmax is applied to this vector to rescale the similarity coefficients and make them sum to $1$. This way we can use them to compute a mean $h^{enc}$ vector to be used for prediction that allows the network to focus on some input tokens by making some coefficient relatively much greater than the others. Mean $h^{enc}$ vector is then computed and followed by $tanh$ operation to reduce vector input space of next operation. Final step is a softmax fully connected layer over the $tanh$ vector for prediction of the next decoded token. Applying attention mechanism involves iterating over this for each decoding timestep.
</span>
- 2) Compare the perfomances of your model at inference time with and without attention mechanism. Do you see noticeable differences? Why?
<span style="color:green">
In this example, no noticeable difference is to be found between the performances of the $2$ different implementations, with and without attention mechanism. Also some quick visualization tells us that the network does not really focus much on part of the input to predict $1$ decoded token at a time. The reason for that is the encoding-decoding problem here is specific in the way that almost all input tokens are involved in producting each output token.
</span>
