# Encoders and Decoders in Sequence to Sequence models and Attention Mechanisms purpose.
In Sequence to Sequence models our input is the sequence of elements (e.g words) and the output is also sequence of elements.<br>

Machine translation example<br>
<img src="images/AM1.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

Generally, **encoder**  turn all inputs into one vector called **Context Vector** which encoder was able to capture from input sequence. Then that vector is passed to the **encoder** which produces output sequence.<br><br><br>
**Example**: In Machine Translation, encoder and decoder are both RNN, often with LSTM cells. The whole process looks as follows(image above):<br><br>
Let's assume that `word1 = comment`, `word2 = allez`, `word3 = vous`:<br><br>
We process `word1` with Encoder, which outputs `HiddenState1`, then using as inputs `word2` and `HiddenState1` we produce `HiddenState2`. After that we take `word3` and `HiddenState2` as an input and process that with Encoder and get `HiddenState3` which becames our **Context Vector**. Our Context Vector `(HiddenState3)` is an input for Decoder which produce an output consists of 3 words translated to specific language.<br>

In that approach there is one serious limitation. The size of our **Context Vector** is constatnt, disregarding lenght of our input size. One could say that enlarging size of the Context Vector would address this problem but then with small input sequences problem of **overfittting** occurs. To solve that problem **Attention Mechanism** were introduced!

# Sequence to Sequence model with Attention
 
Attention in Encoder<br>
<img src="images/AM2.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

Process look very simmilar to the previous one but here instead of taking only last hidden state, we take all of them. After that a t each timestep **Attention Decoder** decides which vector should be used as an input, so it process it one by one but not necessarily sequentially. Decoder is trained during training on which vector should it be focused on next. Below, very interesting example is presented.<br>

Translation of sentence in French to English<br>
<img src="images/AM3.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

**Assumptions**: Behaviour presented above is presented on the **trained model**. It should be unterstood as follows.<br>

For each word in pink box, the squares in corresponding column addresses attention that was paid to the words presented in green box (the brigter the more attention was paid). So in order to get second word `agreement`, Decoder paid great attention to the vector representing word `accord` and a little attention to the word `L'`. What is intertesting about it is that until fourth word `la`, the process went sequentailly. Then Decoder decided to translate word `europeenne`, so it jumped over 2 words!. That was because of differences between the order of words in those two languages. Decoder with attention was able to capture that dependency (that is amazing in my personal opinion).<br>

### Attention Encoder

Commonly, Attention encoder is one of the following: simple RNN, LSTM or GRU. Before words can actually be treated as inputs, they have to pass **[embedded layer](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#word-embeddings-in-pytorch)** which turns every word into a vector of length `n`.<br>

Attention Decoder flow<br>
<img src="images/AM4.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

We pass each embedden vectors through the RNN and sequentaily produces hidden states for our sequence of words the same as it is described above. After that, we pass those hidden states to the Encoder.<br>

### Attention Encoder

Attention Encoder flow<br>
<img src="images/AM5.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

Now, the Encoder is in possesion of **scoring function** (which will be described later in more details) whichc produce a score for each hidden state vector (score == importance of the vector in transaltion). Having a score for each vector we apply softmax function in order to obtain probability distribution. Having taht we can compute **Contex Vector (c)** which is simple weighted average of all hidden state vectors. where weights are Softmax scores.<br><br>

So now to produce first hidden state of the encoder and the output we take as inputs `Context Vector` and `<END> token`. After that, we pass hidden state to the next cell where the input will be output from previous cell (*how* is the input in cell when we are predicting *are*).<br>

#### Multiplicative Attention
It's time to tell more about that misterious **scoring function**.<br>

$score(h_{t}, \bar{h}_{s}) = h_{t}^{T} \circ \bar{h}_{s}$, where<br>
$h_{t}$ - hidden state of the Decoder at timestep $t$<br>
$\bar{h}_{s}$ - matrix of hidden states that came from the Encoder<br>
$\circ$ - dot product operation<br>

It is worth knowing what exacly the dot product of two vectors is.<br><br>
$x \circ y = |x||y|cos( \alpha)$<br><br>
When lengths of 2 vectors are the same, the bigger dot product is the more similar vectors we have. It is because when the angle approaches 0 degrees (vectors pointing in the same direction), value of $cos(\alpha)$ increases up until 1 ( when angle is 0), so the dot product increases also. Image below shows that perfectly.

Multiplicative Attention<br>
<img src="images/AM6.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

In the example above, it can be seen that it is possible only when we have the same size of **embedding space**. Sometimes, it is not the case and sizes are different. So to address this problem there is a **general** approach. We have to use linear transofrmation method in order to adjust the size of hidden state of the Decoder to size of matrix of hidden states of the Encoder.<br>

$score(h_{t}, \bar{h}_{s}) = h_{t}^{T} W_{a} \bar{h}_{s}$, where<br>
$h_{t}$ - hidden state of the Decoder at timestep $t$<br>
$\bar{h}_{s}$ - matrix of hidden states that came from the Encoder<br>
$\circ$ - dot product operation<br>
$W_{a}$ - weight matrix which serves as linear transformation trained with the model<br>

Knowing all that we can go more in depth of Decoder behaviour at timestep $t$. Before that, proper notation has to be introduced.<br>

Multiplicative Attention Decoding<br>
<img src="images/AM7.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>

$h_{init}$ - initial hidden state vector of Decoder<br>
$end$ - end-of-sequence symbol<br>
$h_{t}$ - hidden state of the Decoder at timestep $t$<br>
$\bar{h}_{s}$ - matrix of hidden states that came from the Encoder<br>
$n$ - number of words in input sequence
$c_{t}$ - context vector at timestep $t$<br>
$in_{t}$ - input at timestep $t$<br>
$out_{t}$ - output at timestep $t$<br>
$s_{t}^{i}$ - score vector at timestep $t$ where $i$-th element corresponds with $i$-th column in $\bar{h}_{s}$

1. $h_{t-1}$ ($h_{init}$ during first timestep in Decoder)  and Decoder input $in_{t}$ ($end$ during first timestep in Decoder) are processed in RNN/LSTM/GRU cell producing $h_{t}$
2. Score is computed $score(h_{t}, \bar{h}_{s}) = h_{t}^{T} \circ \bar{h}_{s} = [s_{t}^{1}, s_{t}^{2}, s_{t}^{3}, \ldots]$, then softmax function is applied to $s_{t}$ vector and at last context vector is computed $$c_{t} = \sum_{i=1}^{n} s_{t}^{i} \bar{h}_{s}^{i}$$ 
3. We **concatenate** $c_{t}$ and $h_{t}$ and put it through fully-connected layer with $tanh$ activation function which produce $out_{t}$ 
4. Lastly, $out_{t}$ becomes $in_{t+1}$ and the loop starts for timestep $t+1$

#### Additive Attention
Basically, that approach differs only in the form of scoring function. In additive case it looks as follows:<br>

$score(h_{t}, \bar{h}_{s}) = v_{a}^{T}tanh(W_{a}[h_{t};\bar{h}_{s}])$, where<br>
$h_{t}$ - hidden state of the Decoder at timestep $t$<br>
$\bar{h}_{s}$ - matrix of hidden states that came from the Encoder<br>
$W_{a}, v_{a}^{T}$ - weight matrices in ffeedforward connected neural net that produce score<br>
$[x;y]$ - concatenation operation of vectors $x$ and $y$<br>

Additive Attention scoring function<br>
<img src="images/AM8.png" width="700"><br>
Credits: Udacity Computer Vision Nanodegree<br>