I was curious about sequence to sequence model after reading this article [here](https://weiminwang.blog/2017/09/29/multivariate-time-series-forecast-using-seq2seq-in-tensorflow/). So I examine the source code in Python and Tensorflow to explore the methodology.

A quick summary:
- **Weimin**'s article constructs the model from Tensorflow and uses the output of the *decoder* LSTM cell as the input for the next round
- **Machinelearning Mastery** only outputs 1 word, but he then appends this word to the input and fits that into the model again. This is done outside of Keras and therefore does not enter the training process.
- **Keras' official example** uses TimeDistributed and RepeatedVectors layers to achieve this.

## Weimin's example

In [None]:
'''
https://github.com/aaxwaz/Multivariate-Time-Series-forecast-using-seq2seq-in-TensorFlow/blob/master/build_model_basic.py
'''
with variable_scope.variable_scope(scope or "rnn_decoder"):
    state = initial_state
    outputs = []
    prev = None
            
    for i, inp in enumerate(decoder_inputs):
        # enumerate through each input???
        # i = index in each batch or the column in dex?
        if loop_function is not None and prev is not None:
            # this condition does not apply for the first one
                  
            with variable_scope.variable_scope("loop_function", reuse=True):
                inp = loop_function(prev, i)
        if i > 0:
            variable_scope.get_variable_scope().reuse_variables()
        
        output, state = cell(inp, state)
        outputs.append(output)
        if loop_function is not None:
            prev = output
            
    return outputs, state

This code below is for the case that we have already processed the first input (prev is **not None**), we change the input using the loop function. In fact, you can see this in the comment.

In [None]:
if loop_function is not None and prev is not None:
    with variable_scope.variable_scope("loop_function", reuse=True):
        inp = loop_function(prev, i)

In [None]:
'''
loop_function: If not None, 
                    this function will be applied to the i-th output
                                in order to generate the i+1-st input, 
                    and decoder_inputs will be **ignored**,
                                except for the first element ("GO" symbol). 
            This can be used for decoding,
              but also for training to emulate http://arxiv.org/abs/1506.03099
'''

Notice that *prev* is the previous output, as we can see in the code. Basically, we feed the previous output back into our model to generate more output.

This is different from a normal recurrent neural network, as you can see in the implementation below [source](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/): we feed an input X as a sequence into the RNN by 1 slice at a time. Each output is not related to the next input.

In [None]:
'''
Source: https://github.com/dennybritz/rnn-tutorial-rnnlm/blob/master/RNNLM.ipynb
'''
def forward_propagation(self, x):
    # The total number of time steps
    T = len(x)
    
    # During forward propagation we save all hidden states in s because need them later.
    # We add one additional element for the initial hidden, which we set to 0
    s = np.zeros((T + 1, self.hidden_dim))
    s[-1] = np.zeros(self.hidden_dim)
    
    # The outputs at each time step. Again, we save them for later.
    o = np.zeros((T, self.word_dim))
    
    for t in np.arange(T):
        # Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.
        s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))
        o[t] = softmax(self.V.dot(s[t]))
    return [o, s]

## Machine learning mastery
To generate outputs in sequence, we can append the output as a new part of the input, as in this website [Machinelearning mastery](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/):

In [None]:
def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
    result = list()
    in_text = seed_text
    
    # generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict probabilities for each word (1 more word)
        yhat = model.predict_classes(encoded, verbose=0)
        
        # map the predicted word index to the one corresponding word
        # Note: just 1 word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        
        # append this word to the input text
        # if the input text is too long, we truncate it from the left.
        in_text += ' ' + out_word
        result.append(out_word)
    return ' '.join(result)

## Keras
A simple way to take this into account in the modeling process can be achieved through Keras, as seen on Keras' website [here](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). Here I reproduce the code for the trivial case [here](https://github.com/keras-team/keras/blob/master/examples/addition_rnn.py) with some modification for clarity.

In [16]:
from keras import layers
from keras.layers import LSTM, GRU, SimpleRNN
from keras.models import Sequential

RNN = SimpleRNN

DIGITS = 3
MAXLEN = DIGITS + 1 + DIGITS

BATCH_SIZE = 128
LAYERS = 1
chars = '0123456789+ '
print("Length of vocabularty is ",len(chars))

def GenerateModel(RNN):
    model = Sequential()
    
    # Decoder
    model.add(RNN(30, input_shape=(MAXLEN, len(chars))))
    model.add(layers.RepeatVector(DIGITS + 1))
    
    # Assume only 1 hidden layer for simplicity
    model.add(RNN(40, return_sequences = True))
        
    model.add(layers.TimeDistributed(layers.Dense(len(chars))))
    
    model.add(layers.Activation('softmax'))
    
    # output can be 1 out of 12 so we use categorical crossenotry
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
    
    return model

model = GenerateModel(RNN)
model.summary()
print("Input shape is", model.input_shape)

Length of vocabularty is  12
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_22 (SimpleRNN)    (None, 30)                1290      
_________________________________________________________________
repeat_vector_14 (RepeatVect (None, 4, 30)             0         
_________________________________________________________________
simple_rnn_23 (SimpleRNN)    (None, 4, 40)             2840      
_________________________________________________________________
time_distributed_14 (TimeDis (None, 4, 12)             492       
_________________________________________________________________
activation_14 (Activation)   (None, 4, 12)             0         
Total params: 4,622
Trainable params: 4,622
Non-trainable params: 0
_________________________________________________________________
Input shape is (None, 7, 12)


A few basic things to note:
- The vocabulary has 12 characters so each input (of the batch) is a sparse matrix of size 7 * 12 (maximum 7 characters)
- The first layer, encoder layer, has 1290 parameters, implying 1290/30 = 43 parameters per RNN cell.
- The layer second to last hold all outputs of the previous layers. Its output is of size 4 * 12 ( for 4 characters). With sigmoid activation, it will output 4 characters.

The other interesting thing is:
- *RepeatVector*, which has no parameter. This layer repeats the input for **DIGITS+1** times, as this is the maximum length of the ouput. Its input size is 30, which corresponds to 30 RNN cells of the previous layers. The choice of 30 is arbitrary. Hence the output shape is 4*30.
- The layer after Repeat Vector has *return_sequences=True* and will return all the outputs instead of only the last one. The number of times it will output is 4, corresponding to the number of times the RepeatVector will repeat its inputs. Note: This should follow the Python RNN code above: for each input, there is only 1 output. <br> The output of this layer is 4*40, with 40 as an arbitrary value for the number of cells.
- The TimeDistributed layer wraps around a Dense layer of 12 elements, and handles each time slice of its input separately. <br> Its input is 4x40. <br> Its output is 4x12, with the number of parameters is 492 = 12x(40 + 1) (+1 is for the bias), so we know that it will **share** the parameters among the Time Slices of the input.

What happens if we set the number of hidden cells to 1:

In [18]:
def GenerateModel_Simple(RNN):
    model = Sequential()
    
    # Decoder
    model.add(RNN(1, input_shape=(MAXLEN, len(chars))))
    model.add(layers.RepeatVector(DIGITS + 1))
    
    # Assume only 1 hidden layer for simplicity
    model.add(RNN(1, return_sequences = True))
        
    model.add(layers.TimeDistributed(layers.Dense(len(chars))))
    
    model.add(layers.Activation('softmax'))
    
    # output can be 1 out of 12 so we use categorical crossenotry
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
    
    return model

GenerateModel_Simple(RNN).summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
simple_rnn_24 (SimpleRNN)    (None, 1)                 14        
_________________________________________________________________
repeat_vector_15 (RepeatVect (None, 4, 1)              0         
_________________________________________________________________
simple_rnn_25 (SimpleRNN)    (None, 4, 1)              3         
_________________________________________________________________
time_distributed_15 (TimeDis (None, 4, 12)             24        
_________________________________________________________________
activation_15 (Activation)   (None, 4, 12)             0         
Total params: 41
Trainable params: 41
Non-trainable params: 0
_________________________________________________________________


So for simple_rnn_24: number of paramters = 14 = 12 (# inputs) + 2 <br>
For simple_rnn_25: number of parameters = 3 = 1 (# inputs) + 2 <br>
Time distributed layer: 24 = 12 * (1 (# of inputs) + 1)

## More advanced Keras
This is also an example from Keras' official example [here](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html). Like Weimin's example, this model features an encoder layer, a decoder layer and a layer to generate target data. The process is as follows (from the website):
- **encoder_input_data** is a 3D array of shape (num_pairs, max_english_sentence_length, num_english_characters)
- **decoder_input_data** is a 3D array of shape (num_pairs, max_french_sentence_length, num_french_characters)
- **decoder_target_data** is the same as decoder_input_data but offset by one timestep. decoder_target_data[:, t, :] will be the same as decoder_input_data[:, t + 1, :].

The model uses teacher forcing method. According to the comment in the script, *teacher forcing* offsets the target sequence by one time step in the future. Its initial state are the state vectors from the encoder. Effectively, the decoder learns to generate targets[**t+1**...] given targets[...**t**], conditioned on the input sequence. 

We actually use the first character of the sequence with the start character

In [20]:
target_seq[0, 0, target_token_index['\t']] = 1

The training model illustrates 3 features of Keras' RNN:
- The **return_state** contructor argument, configuring a RNN layer to return a list where the first entry is the outputs and the next entries are the internal RNN states. This is used to recover the states of the encoder.
- The **inital_state** call argument, specifying the initial state(s) of a RNN. This is used to pass the encoder states to the decoder as **initial states**.
- The **return_sequences** constructor argument, configuring a RNN to return its full sequence of outputs (instead of just the last output, which the defaults behavior). This is used in the decoder.

In [None]:
rom keras.models import Model
from keras.layers import Input, LSTM, Dense

# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)

# LSTM with return_state = true will return state_h, state_c and its output (possibly final output)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [None]:
# Set up the decoder, using `encoder_states` as initial state.
# We set up our decoder to return full output sequences,
# and to return internal states as well. 
# We don't use the return states in the training model
# but we will use them in inference.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)

decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
                                     initial_state=encoder_states)

In [None]:
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

This model is then fitted onto the training data. The inference process will need some effort to be setup.
- Create an encoder model from variable **encoder_inputs** and **encoder_states**
- Create a decoder model that takes decoder_inputs + decoder_states_inputs, and output_decoder_outputs + decoder_states

In [None]:
# All the weights have been trained previously
encoder_model = Model(encoder_inputs, encoder_states)

...
# decoder_lstm has been trained from before
# it is set up to output the state and output a sequence
decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

In [None]:
# The inference process looks like this
states_value = encoder_model.predict(input_seq)

target_seq = np.zeros((1, 1, num_decoder_tokens))
target_seq[0, 0, target_token_index['\t']] = 1.

# And loop:
# The target sequence will be updated
# but the states value will remain unchanged in the loop
output_tokens, h, c = decoder_model.predict(
            [target_seq] + states_value)

sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += sampled_char