## Sequence to Sequence modelling notebook

In [1]:
from lib.initialize import *

<div class=width>

### 1. Sequence to sequence modelling tasks

Sequence to sequence modelling tasks are machine learning tasks where both the inputs and output are sequences.

Some examples:
<p align="left">
<img src="resources/assets/seq_to_seq_applications.png" alt="drawing" width="800" >
</p>

Let's first see some examples of sequence to sequence modelling problems.
</div>

<div class=width>

### Examples of sequence to sequence (seq2seq) tasks
Let's look at some concrete seq2seq tasks for illustration. 

#### 1. Shift a sequence
This is a toy example where we just shift a sequence to the right, and pad zeros at left. For example we want shift a input by three steps,
$$\begin{align*}
\text{Input:} & \, 5,8,9,0,1,2,5,6 \\
\text{Output:} &\,  0,0,0,5,8,9,0,1
\end{align*}$$
Shifting inputs by $k $ units is actually a linear relation which equivalent to a convolution of the input with a delta impulse,
$$\begin{align*}
y(t) = \sum_s \delta(s-k)x(t-s)
\end{align*}$$
where 
$$
\delta(s-k) = \begin{cases}
                    1 & s = k \\
                    0 & \text{else}
                \end{cases}.
$$

This relationship is completely determined by the parameter $k$. If can also be considered as the memory of this relationship because we have $y(t) = y(t-k)$. thus when $k$ is large $y(t)$ will depend on a input far from it.
 
Theoretically RNN does not perform well on this task while CNN have very good performance. 
>(See our paper [Approximation Theory of Convolutional Architectures for Time Series Modelling](https://proceedings.mlr.press/v139/jiang21d.html))


</div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>

In [2]:
ShiftPlotter(k=25).plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

<div class=width>

#### 2. Convolution of a sequence
Convolution is one of the most basic operation which can be considered as a sequence to sequence task.
Suppose $\bm\rho$ is a convolution filter, then the convolution of the input $\bm x$ with the filter is given by
$$\begin{align*}
y(t) = \bm\rho \ast\bm x =\sum_s \rho(s)x(t-s).
\end{align*}$$
In this case the filter $\bm \rho$ determine the relationship.

<div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>





In [3]:
ConvoPlotter().plot()

VBox(children=(HBox(children=(HBox(children=(Text(value='0.002, 0.022, 0.097, 0.159, 0.097, 0.022, 0.002', des…

<div class=width>

#### 3. Lorentz96 System

Now let's look at a more complicate example where the input ouput relationship is determined by an nonlinear dynamic system.
  
  
The system have $K$ inputs $\{x_k\}$, $K$ outputs $\{y_k\}$ and $JK$ hidden variables $\{z_{j,k}\}$ with $k = 1, 2, \dots, K$ and $j = 1, 2, \dots, J$. The parameters $K,J$ control the number of variables in the system, and can be viewed as a complexity measure.
The system satisfies the following dynamics
\begin{align*}
    \frac{dy_k}{dt} & = -y_{k-1}(y_{k-2}-y_{k+1})-y_k + {\color{green} x_k}  - \frac{1}{J}\sum_{j=1}^J z_{j,k},  \\
    \frac{dz_{j,k}}{dt} & = -z_{j+1,k}(z_{j+2,k}-z_{j-1,k})-z_{j,k} + y_k.
\end{align*}

Thus, given a set of input $\{x_k\}$, the systems determins a set of outputs $\{y_k\}$.
The following plot shows an example with $K=1$, where we have one curve as input, and the system gives an output curve.
<div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>


In [4]:
LorentzPlotter().plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

<div class=width>

#### 4. Text Generation

This is a real life example for sequence prediction task. Given a begining of a sentence the model will try to write the remaining part. We can generate long paragraphs of articles using this. To generate nice and meaningful text we may need a very model, here we only use a small model for demostration.

There are two types of models:
- Character : Character level model, the model will generate single character each step.
- Word : Word level model, the model generate a word each step.

<div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>


In [5]:
TextGenerator().plot() 

VBox(children=(HBox(children=(HTML(value=' <font size="+0.4">Model Type: </font>'), ToggleButtons(options=('Ch…

<div class=width>

### 2. Baisc Architectures for seq2seq modelling

Next let's look at some basic architectures for seq2seq modelling, we will began with recurrent neural work (RNN), which is one of the most simple architectures.
</div>

<div class=width>

#### 1.Recurrent neural networks (RNN)

Recurrent neural network is the most basic sequence to sequence model. The dynamic can be written as 
$$\begin{align*}
h_{t+1} &= \sigma(Wh_{t} + Ux_{t} + b)\\
o_{t+1} &= c^\top h_t.
\end{align*}$$ 
Where $h$ is called the hidden state. Note that this architecture is causal such that the output $o_t$ at time $t$ only depends on inputs up to $t$. 
 
<p align="center">
<img src="resources/assets/rnn.png" alt="drawing" width="500" >
</p>

Based on the structure above we can have input output pairs having same length, which is typical supervised learning tasks. We can also feed the output $o_t$ as the input $x_{t+1}$, which forms an autoregressive stucture and are usually applied to time series prediciton  or sequence generation. 

In the following demo implementation, the model takes a input with size **(batch size, input len, input dim)**, and output having size **(batch size, input len, output dim)**.

</div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>

In [6]:
# Example Implementation
class RNN(Module):
    def __init__(self,
                 input_dim,
                 output_dim,
                 hid_dim,
                 activation=nn.Tanh()
                ):
        super().__init__()   
        self.U = nn.Linear(input_dim, hid_dim)
        self.W = nn.Linear(hid_dim,hid_dim)
        self.c = nn.Linear(hid_dim, output_dim)
        self.hid_dim = hid_dim
    def forward(self, x, initial_hidden=None):

        #src = [batch size, input len, input dim]
        length = x.shape[1]
        batch_size = x.shape[0]

        hidden = []
        # Initial hidden state
        if initial_hidden is None:
            hidden.append(torch.zeros(batch_size, 1, self.hid_dim, dtype=x.dtype, device=x.device))
        else:
            hidden.append(initial_hidden)
            
        # recurrent relation
        for i in range(length):
            h_next = self.activation(self.W(hidden[i]) + self.U(x)[:,i:i+1,:])
            hidden.append(h_next)

        # Convert all hidden into a tensor
        hidden = torch.cat(hidden[1:], dim=1)

        # output mapping
        out = self.c(hidden)

        return out

<div class=width>

Let's now test the model on the tasks we mentioned above. and plot the prediction against the output.
You can find the saved model parameters inside folder: `resources/saved_models/lorentz`

</div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>

In [7]:
LorentzEvaluation(RNNModel, path='resources/saved_models/lorentz/rnn_1_10_128.ckpt').plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

<div class=width>

A simple RNN model can nearly learn the basic pattern of this relationship, for better performance we may need a larger and better structured model.
Let's also take a look the training curve.

<p align="center">
<img src="resources/assets/rnn_lorentz_loss.png" alt="drawing" width="600" >
</p>

Note that there is a plateauing region where the loss nearly stops to decay. This phenomenon have been analysed in the work
[Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks](https://www.jmlr.org/papers/volume23/21-0368/21-0368.pdf).
The main idea is that RNNs are hard to learn targets with long memory, there will be a plateauing in the training loss, and the length of plateauing region is exponential to the memory.

Let's next look at the shift sequence example. We generate a 32 step sequence and move it to the right by 8 steps. Let's first look at the tain loss.
<p align="center">
<img src="resources/assets/rnn_shiftseq_loss.png" alt="drawing" width="600" >
</p>
The plateauing also occurs here. 

Next let's look at how the model makes the prediction, first click the `Refresh` button to randomly pick some samples.
You may have note that, the prediction from the model seems not following the output patter well, instead, it tries to give a output which is near zero.
The training data and the loss function are the main reasons lead to this kind of behaviour. The mean of all the training sequence actually is the constant zero sequence, and the loss we used is MSE loss. Thus when the model cannot properly learn the target, provide a constant zero output can make the loss small as the loss is taking average on all the inputs.




</div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>

In [8]:
ShiftEvaluation(RNNModel, 'resources/saved_models/shift/rnn_8_32.ckpt').plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

<div class=width>

#### 2.Convolutional neural networks (CNN)

CNNs are widely applied on computer vision related tasks, however a 1D convolution can also be applied to sequence related tasks since convolution is naturally a seq2seq mapping. To apply CNN on sequential data, people often use a dialted convolutions structure as shown in the picture below. By using small filters with dilation rate increasing exponentially, the model can finally achieve a very large filter with few parameters. For example in the following picutre, the convolutin filter achieve a receptive field of size 16 with only 8 parameters. This give arises a low rank structure, which have been disscussed in 
[Approximation Theory of Convolutional Architectures for Time Series Modelling](http://proceedings.mlr.press/v139/jiang21d.html).
 
<p align="center">
<img src="resources/assets/cnn.png" alt="drawing" width="500" >
</p>

In the paper we discussed the difference between CNN and RNN structure on seq2seq modelling problems. The approximation capability of CNN is better then RNN for targets which is not smooth or have long memory. Through experiments we can also see CNN indeed perform better on certain tasks. 

Next let's look at the shift sequence exampe and lorentz system example using convolutional structures.
For the Lorentz system task, we can see that CNN performs much better then RNN as it captures the pattern very well. It also have a smoother training curve.

The CNN model have about 40K parameters while the RNN model have around 791K parameters. Even though the RNN have much more parameters then a CNN, the performance may not be better.


<p align="center">
<img src="resources/assets/tcn_lorentz_loss.png" alt="drawing" width="600" >
</p>


</div>


<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>

In [9]:
LorentzEvaluation(TCNModel, path='resources/saved_models/lorentz/tcn_1_10_128.ckpt').plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

<div class=width>

For the shift seqeunce example we can hardly tell the differnece from the plot. Shifting a sequence is actually a convolution operation thus the CNN is able to represent it exactly. The training curve is shown in the following image.


<p align="center">
<img src="resources/assets/tcn_shift_loss.png" alt="drawing" width="600" >
</p>

</div>

<style>
div.width {

    margin:auto;
    max-width: 1000px;
}
</style>

In [10]:
ShiftEvaluation(TCNModel, 'resources/saved_models/shift/tcn_32_128.ckpt').plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

In [3]:
ShiftEvaluation(TransformerModel, 'resources/saved_models/shift/transformer_32_128.ckpt').plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…

In [2]:
LorentzEvaluation(TransformerModel, 'resources/saved_models/lorentz/transformer_1_10_128.ckpt').plot()

VBox(children=(HBox(children=(HBox(children=(HTML(value='&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'), Button(description=…