## Sequence to Sequence modelling notebook

In [1]:
from initialize import *
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [2]:
ConvoPlotter().plot()

VBox(children=(HBox(children=(Text(value='0.002, 0.022, 0.097, 0.159, 0.097, 0.022, 0.002', description='Filte…

### Sequence to sequence modelling tasks

Sequence to sequence modelling tasks are machine learning tasks where both the inputs and output are sequences.

Some examples:
<p align="left">
<img src="assets/seq_to_seq_applications.png" alt="drawing" width="800" >
</p>

Let's first see some examples of sequence to sequence modelling problems.

### Examples of sequence to sequence (seq2seq) tasks
Let's look at some concrete seq2seq tasks for illustration. 

#### 1. Shift a sequence
This is a toy example where we just shift a sequence to the right, and pad zeros at left. For example we want shift a input by three steps,
$$\begin{align*}
\text{Input:} & \, 5,8,9,0,1,2,5,6 \\
\text{Output:} &\,  0,0,0,5,8,9,0,1
\end{align*}$$
Shifting inputs by $k $ units is actually a linear relation which equivalent to a convolution of the input with a delta impulse,
$$\begin{align*}
y(t) = \sum_s \delta(s-k)x(t-s)
\end{align*}$$
where 
$$
\delta(s-k) = \begin{cases}
                    1 & s = k \\
                    0 & \text{else}
                \end{cases}.
$$

Theoretically RNN does not perform well on this task while CNN have very good performance. (See our paper [Approximation Theory of Convolutional Architectures for Time Series Modelling](https://proceedings.mlr.press/v139/jiang21d.html))

In [2]:
ShiftPlotter().plot()

VBox(children=(HBox(children=(HBox(children=(Button(description='Refresh', style=ButtonStyle()),), layout=Layo…

In [4]:
LorentzPlotter().plot()

VBox(children=(HBox(children=(HBox(children=(Button(description='Refresh', style=ButtonStyle()),), layout=Layo…

### Recurrent neural networks (RNN)

Recurrent neural network is the most basic sequence to sequence model. The dynamic can be written as 
$$\begin{align*}
h_{t+1} &= \sigma(Wh_{t} + Ux_{t} + b)\\
o_{t+1} &= c^\top h_t.
\end{align*}$$ 
Where $h$ is called the hidden state. Note that this architecture is causal such that the output $o_t$ at time $t$ only depends on inputs up to $t$. 
 
<p align="center">
<img src="assets/rnn.png" alt="drawing" width="500" >
</p>

Based on the structure above we can have input output pairs having same length, which is typical supervised learning tasks. We can also feed the output $o_t$ as the input $x_{t+1}$, which forms an autoregressive stucture and are usually applied to time series prediciton  or sequence generation. 

In the following implementation, the model takes a input with size **(batch size, input len, input dim)**, and output having size **(batch size, input len, output dim)**. 

### Demo experiments

Next we show some concrete examples on sequence to sequence tasks.

In [3]:
class DCN(Module):
    def __init__(self,
                 input_dim,
                 output_dim,
                 hid_dim,
                 kernel_size,
                 num_layers,
                 activation='linear'
                ):
        super().__init__()
        self.conv_layers = nn.ModuleList([nn.Conv1d(hid_dim,hid_dim,kernel_size,padding=(kernel_size-1)*(kernel_size**i),dilation=kernel_size**i,bias=False) for i in range(num_layers)])
        self.input_ff = nn.Linear(input_dim, hid_dim)
        self.output_ff = nn.Linear(hid_dim, output_dim)
        if activation == 'linear':
            self.activation = nn.Identity()
        elif activation == 'tanh':
            self.activation = nn.Tanh()
        else:
            raise Exception("Uknow actication type")

    def forward(self, x):
        #src = [batch size, input len, input dim]
        length = x.shape[1]
        x = self.input_ff(x)

        x = x.permute(0,2,1)

        for layer in self.conv_layers:
            x = x + layer(x)[:,:,:length]
       
        x = x.permute(0,2,1)

        y = self.activation(self.output_ff(x))
        return y

### RNN encoder-decoder

Here we introduce the encoder-decoder structure, where we first encodes the input into  the context vector and then decodes the context vector into the output. Suppose both encoder and decoder are RNNs then we can write this relation as 
$$
\begin{align*}
h_s &= \sigma_E( W_Eh_{s-1}+U_Ex_s+ b_E), 
    \hspace{5mm} v = h_\tau,\\
     g_t &= 
     %\hl{
     \sigma_D( W_Dg_{t-1}+ b_D),
     %}
    \hspace{16mm} g_0=v, \\
    o_t&= 
    %\hl{
    W_O g_t+b_O.
\end{align*}
$$

<p align="center">
<img src="assets/encdec.png" alt="drawing" width="500" >
</p>


In [4]:
class EncDec(Module):
    def __init__(self,
                 input_dim,
                 output_dim,
                 hid_dim,
                 output_len,
                 activation='linear'
                ):

        super().__init__()
        self.encoder = RNN(input_dim,output_dim, hid_dim, activation, return_hidden=True)
        self.decoder = RNN(input_dim,output_dim, hid_dim, activation)
        self.out_len = output_len
    def forward(self, x):
        _, context = self.encoder(x)
        context = context[:,-2:-1,:]
        batch_size = x.shape[0]
        decoder_input_pad = torch.zeros(batch_size,self.out_len,x.shape[-1], dtype=x.dtype, device=x.device)

        y = self.decoder(decoder_input_pad, context)

        return y

In [5]:
data_name = f'Shift'

train_size = 3000
test_size = 500

train_dataset = Dataset(*Shift({'path_len':32,'shift': 20}).generate(data_num=train_size), dtype=DTYPE, device=device)
test_dataset = Dataset(*Shift({'path_len':32,'shift': 20}).generate(data_num=test_size), dtype=DTYPE, device=device)

train_data = torch.utils.data.DataLoader(train_dataset, batch_size=128,drop_last=True)
test_data = torch.utils.data.DataLoader(test_dataset, batch_size=128,drop_last=True)

In [10]:
experiment_name = f'{data_name}_rnn'
rnn = RNN(input_dim=1, output_dim=1, hid_dim=32).double().to(device)
rnn.load_state_dict(torch.load(f"saved_model/{experiment_name}/best_valid.pt"))
rnn.count_parameters()
# train_model(name=experiment_name,model=rnn,train_data=train_data, test_data=test_data)
print(f'Best Valid Loss: {np.mean(inference(rnn, test_data)):.2e}')

The model has 1,153 trainable parameters
Best Valid Loss: 1.95e-05


In [11]:
experiment_name = f'{data_name}_cnn'
cnn = DCN(input_dim=1, output_dim=1, hid_dim=1, kernel_size=2, num_layers=5).double().to(device)
cnn.load_state_dict(torch.load(f"saved_model/{experiment_name}/best_valid.pt"))
cnn.count_parameters()
# train_model(name=experiment_name,model=cnn,train_data=train_data, test_data=test_data)
print(f'Best Valid Loss: {np.mean(inference(cnn, test_data)):.2e}')

The model has 14 trainable parameters
Best Valid Loss: 2.28e-32


In [15]:
experiment_name = f'{data_name}_encdec'
encdec = EncDec(1, 1, hid_dim=64, output_len=32).double().to(device)
# encdec.load_state_dict(torch.load(f"saved_model/{experiment_name}/best_valid.pt"))
train_model(name=experiment_name,model=encdec,train_data=train_data, test_data=test_data)
print(f'Best Valid Loss: {np.mean(inference(encdec, test_data)):.2e}')

Train Loss: 3.135e-02 	 Val. Loss: 3.129e-02 	 Best Loss: 3.135e-02 	 Current lr: 1.000e-08: | 1000/1000 [06:30<00:00,  2.56it/s]

Best Valid Loss: 3.13e-02





In [None]:
dataset = Dataset(X,y , dtype=DTYPE, device='cuda')

train_data = torch.utils.data.DataLoader(train_dataset, batch_size=128,drop_last=True)
test_data = torch.utils.data.DataLoader(test_dataset, batch_size=128,drop_last=True)