Source: https://huggingface.co/learn/nlp-course/chapter1/7?fw=pt

# Sequence-to-sequence models

Encoder-decoder models (also called <i>sequence-to-sequence models</i>) <span style="color:blue">use both parts of the Transformer architecture. At each stage, <b>the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input</b>.</span>

<span style="color:blue">The <b>pretraining</b> of these models can be done using the <b>objectives of encoder or decoder models</b>, but usually involves something a bit more complex. For instance, [T5](https://huggingface.co/t5-base) is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces.</span>

Sequence-to-sequence models are <span style="color:blue">best suited for tasks revolving around generating new sentences depending on a given input, such as <b>summarization</b>, <b>translation</b>, or <b>generative question answering</b>.

Representatives of this family of models include:

- [BART](https://huggingface.co/transformers/model_doc/bart.html)
- [mBART](https://huggingface.co/transformers/model_doc/mbart.html)
- [Marian](https://huggingface.co/transformers/model_doc/marian.html)
- [T5](https://huggingface.co/t5-base)
- ProphNet
- mT5
- Pegasus
- M2M100

<img src="images/Encoder-Decoder-Models-1.png" style="width:500px;" title="encoder-decoder models">

<b>Encoder:</b>
<span style="color:blue">Generates a numerical representation (which contains information of the meaning of the sequence) for each input word</span>

<b>Decoder:</b>
<span style="color:blue">We are passing the outputs of the encoder directly to decoder. Additional to the encoder outputs, we also give the decoder a sequence. When prompting the decoder for an output with no initial sequence, we can give it the value that indicates the "start of a sequence". </span>

<b>Encoder-Decoder:</b>
<span style="color:blue">In summary, the encoder accepts a initial sequence as input, and computes predictions and outputs a numeric representations. It has in a sense, encoded the sequence. Then, it sends that over to the decoder for it to decoding. Since the numeroc representations are already generated by encoder, we can discard the encoder after one run. And, the decoder in turn, using this input alongside its usual sequence input, will take a stab at decoding the sequence. The decoder decodes the sequence and outputs a word. The "start of sequence word" indicates to the decoder that it should start decoding the sequence.</span>

<img src="images/Encoder-Decoder-Models-2.png" style="width:600px;" title="encoder-decoder models">
<img src="images/Encoder-Decoder-Models-3.png" style="width:600px;" title="encoder-decoder models">

<span style="color:blue">Now that we have both the encoder numeric representation (feature vector) and an initial generated word, we don't need the encoder anymore. Decoder will act in an <b>auto-regressive manner</b> by using the word it has just outputted as input. This in combination with numerical representations output by the encoder, can now be used to generate the second word. We can continue on and on until the decoder outputs a value that we consider as a stopping value (e.g., ".")</span> 

<img src="images/Encoder-Decoder-Models-4.png" style="width:600px;" title="encoder-decoder models">

<b>Example: Translation Language Model (or Transduction)</b>
    
Step 1: Use encoder to generate numerical representation of the english sentence (e.g., "Welcome to NYC")
Step 2: Cast the encoded numerical representation to decoder along with use of "Start of sequence word" and ask it to output the first word (e.g., "Bievenue"). Once it outputs the first word, it will be used as input sequence to the decoder alongside encoder's numerical representation. thereby allowing decoder to predict the second word (e.g., "à"). Finally, we ask decoder to predict the 3rd word (e.g., "NYC").

<img src="images/Encoder-Decoder-Models-5.png" style="width:600px;" title="encoder-decoder models">

<span style="color:blue"><b>Encoder-Decoder models</b> shine because it has both encoders and decoders</span>, and weigths are not necessarily shared across encoder and decoder. Besides, output length can be independent of input length, as encoders and decoders are separated.
- <span style="color:blue"><b>Encoder</b>: Understands the sequence and extracts relevant information, and puts them in a vector dense info</span>
- <span style="color:blue"><b>Decoder</b>: Sole purpose is to decode numerical representation outputted by encoder. <b>Can specialise in completely different lanuguage, <i>even different modality, such as images or speech</i></b></span>

<b>When should I use sequence-to-sequence-model?</b>
<img src="images/Encoder-Decoder-Models-6.png" style="width:500px;" title="encoder-decoder models">

<b>Example: Translation</b>
<img src="images/Encoder-Decoder-Models-7.png" style="width:600px;" title="encoder-decoder models">

<b>Example: Summarization</b>
<img src="images/Encoder-Decoder-Models-8.png" style="width:700px;" title="encoder-decoder models">

> <span style="color:blue">Additionally, <b>we can load an encoder and decoder inside an encoder-decoder model</b>. Therefore, according to the specific task we are targetting, we may choose to use specific encoders and decoders which are proven to work on these tasks</span>

<img src="images/Encoder-Decoder-Models-9.png" style="width:700px;" title="encoder-decoder models">