# <center>Sequence-to-Sequence Learning with Neural Networks</center>

Throughout this notebook, we will create a **chatbot** using a sequence-sequence network.

In [17]:
from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

1. [What is a sequence to sequence network ?](#sec1)


# 1. <a id="sec1"></a>What is a Sequence-to-Sequence model ?

<i>Sequence-to-Sequence</i> (abrv. Seq2Seq) models are deep learning models that take a sequence of items (sentences, medical signals, speech waveforms, time series, …) and output another sequence of items, hence its name "sequence to sequence".

<video width="852" height="480" controls src="Images/seq2seq_1.mp4" />

These models are explained in the two pioneering papers : [Sutskever et al., 2014](http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf) and [Cho et al., 2014](http://emnlp2014.org/papers/pdf/EMNLP2014179.pdf).

Sequence to sequence models have proven their effectiveness for many tasks in particular in machine translation, text summarization, image captioning, and speech recognition. 

In the case of Neural Machine Translation, the input is a series of words, and the output is the translated series of words. Until 2014, [Statistical Machine Translation](https://en.wikipedia.org/wiki/Statistical_machine_translation) was by far the most widely studied machine translation method, using statistical models. The introduction of [Neural Machine Translation](https://en.wikipedia.org/wiki/Neural_machine_translation) has significantly increased performance and for instance Google introduced in November 2016 its brand new neural machine translation new [Google Neural Machine Translation](https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation) for Google Translate.

<video width="852" height="480" controls src="Images/seq2seq_mt.mp4" />

# 2. <a id="sec2"></a>How is it made ?

A seq2seq network is made of two neural networks. The first one is a **encoder**, which encodes a variable length input sequence to a fixed-length context vector (we will talk about his context vector later). The second one a **decoder** which receives this context vector and produces the output sequence.

<video width="852" height="480" controls src="Images/seq2seq_2.mp4" />

Note that the input is **3** circles, and the output is **4** triangles. 

Indeed, for tranlation for example, the length of the input sequence in language A is not necessarily equal to the length of the output sequence in language B. "Je suis étudiant" becomes "I am a student". A seq2seq model is able to take a variable-length sequence as an input, and return a variable-length sequence as an output, using a fixed-sized model by encoding many inputs into one vector, and decoding from one vector into many outputs. The seq2seq model frees us from sequence length, which makes it ideal for translation between two languages and opens a whole new range of problems which can now be solved using such architecture.

The encoder and decoder neural networks are generelly **RNNs (Recurrent Neural Networks)**.

### Why do we use RNNs ?

Remember, as we saw with Denis with the Time-series Forecasting notebook, recurrent neural networks depend on the previous state for the current state's computation. Instead of simply prediction $Y = f(x)$ as in feed-forward neural networks, recurrent networks do $Y_1 = f(x_1, f(x_0))$.

<img src="Images/unrolled_rnn.png"/>

You can see that at every point in time, it takes as input its own previous state and the new input at that time step.
<div class="alert alert-success">
RNNs remember their previous state.
</div>

Each state is a function of the previous state, which is the function of its previous state, and so on. So, state n contain information from all past timesteps. And we need this to predict sequences. Indeed, elements in the sequences and the order of these elements have a strong relationship. For instance, a sentence contains words in a certain order and some of which have a strong influence on others. 

For example, let's translate using Google Translate "Je surveille mes actions tous les jours" in english :

<img src="Images/mt_ex_1.png"/>

Now, let's translate "A la bourse, je surveille mes actions tous les jours" :

<img src="Images/mt_ex_2.png"/>

**To recap**, here's a animation showing what we have for a seq2seq machine translation model for the moment.

<video width="852" height="480" controls src="Images/seq2seq_4.mp4" />

In [19]:
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden

In theory, the context vector (the final hidden state of the encoder) will contain semantic information about the query sentence that is input to the bot. 
**Problem**: The context vector is responsible for representing the entire input sequence. The output sequence relies heavily on this vector, making it challenging for the model to deal with long sentences. A solution was proposed in the papers [Bahdanau et al., 2014](https://arxiv.org/pdf/1409.0473.pdf) and [Luong et al., 2015](https://arxiv.org/pdf/1508.04025.pdf) : **attention**.

### Attention

Attention allows the model to focus on the relevant parts of the input sequence at every stage of the output sequence allowing the context to be preserved from beginning to end. 

Instead of sending only one single hidden state vector to the decoder, we send a attention vector created from a linear combination of all previous hidden states.

<video width="852" height="480" controls src="Images/seq2seq_5.mp4" />

For every step the decoder can select a different part of the input sentence to consider. So, every step, a new attention vector (so a new linear combination) is calculated to be relevant for the decoder.

<video width="852" height="480" controls src="Images/seq2seq_6.mp4" />

For an example of the translation of the sentence "L'accord sur la zone économique européènne a été signé en août 1992", we can visualize the attention matrix.

<img width="480" height="360"  src="Images/attention.png"/>

You can see how the model, for unambiguous words like "août" or "1992", gives little importance to the other words in the sentence. You can also see that for "zone économique européenne", the model adapts to the reversed order between French and English.

# 3. <a id="sec3"></a>Preprocess the data for our chatbot

In [18]:
#%run -i 'preprocessing.ipynb'

# 4. <a id="sec4"></a>Training the model

# 4. <a id="sec4"></a>Plotting the results

**Vizualizing attention**

**TO SAY**

Since the task is sequence based, both the encoder and decoder tend to use some form of RNNs, LSTMs, GRUs, etc. 

Despite their flexibility and power, DNNs can only be appliedto problems whose inputs and targetscan be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, sincemany important problems are best expressed with sequences whose lengths are not known a-priori.For example, speech recognition and machine translation are sequential problems. Likewise, ques-tion answering can also be seen as mapping a sequence of wordsrepresenting the question to a1
sequence of words representing the answer. It is therefore clear that a domain-independent methodthat learns to map sequences to sequences would be useful.Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs andoutputs is known and fixed. 

We use a GRU (Gated Recurrent Units) wich is the subject of one of AML's notebooks this year (topic number 31), if you want to learn more about it.

We will use a bidirectional variant of the GRU, meaning that there are essentially two independent RNNs: one that is fed the input sequence in normal sequential order, and one that is fed the input sequence in reverse order. The outputs of each network are summed at each time step. Using a bidirectional GRU will give us the advantage of encoding both past and future context.

This is important for a chatbot, unlike machine translation. For example, if your input is "Are you a student ?", machine translation can deal with only the past context. 

But you’ll do one trick you might have never seen before. In deep networks like this one, you need to limit extreme gradient change to ensure that the gradient doesn’t change too dramatically, a technique called gradient clipping.