# What is sequence-to-sequence learning?

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g. sentences in English) to sequences in another domain (e.g. the same sentences translated to French).

![ex](https://imgur.com/XwREty3.jpg)

      "the cat sat on the mat" -> [Seq2Seq model] -> "那隻貓坐在地毯上" 
      
This can be used for machine translation or for free-from question answering (generating a natural language answer given a natural language question) -- in general, it is applicable any time you need to generate text.

There are multiple ways to handle this task, either using RNNs or using 1D convnets. Here we will focus on RNNs.

## The trivial case: when input and output sequences have the same length

When both input sequences and output sequences have the same length, you can implement such models simply with a Keras LSTM or GRU layer (or stack thereof). This is the case in this example script that shows how to teach a RNN to learn to add numbers, encoded as character strings:

![LSTM](https://blog.keras.io/img/seq2seq/addition-rnn.png)


In [1]:
from keras.models import Sequential
from keras import layers
from keras.utils import plot_model
import numpy as np
from six.moves import range
from IPython.display import Image

Using TensorFlow backend.


In [2]:
class CharacterTable(object):
    """
     Give a group of characters:
     + Encode these characters using one-hot encoding into numbers
     + Decode one-hot encoded digits to be the original character
     + Decode the probability of a character to answer the most likely character
    """
    def __init__(self, chars):
        """
        Initialize character table
        
         # Parameters:
             chars: Appears in the possible character set entered
        """
        self.chars = sorted(set(chars))
        self.char_indices = dict((c, i) for i, c in enumerate(self.chars))
        self.indices_char = dict((i, c) for i, c in enumerate(self.chars))
        
    def encode(self, C, num_rows):
        """
        Enter the string one-hot encoding
        
         # Parameters:
             C: The character to be encoded
             num_rows: The maximum number of lines to be returned after one-hot encoding. 
                       This is to make sure that every input is there
                       The same number of lines of output
        """
        x = np.zeros((num_rows, len(self.chars)))
        for i, c in enumerate(C):
            x[i, self.char_indices[c]] = 1
        return x
    
    def decode(self, x, calc_argmax=True):
        """
        The input code (vector) is decoded
        
         # Parameters:
             x: character vector or character encoding to be decoded
             calc_argmax: Whether to use the argmax operator to find the most likely character encoding
        """
        if calc_argmax:
            x = x.argmax(axis=-1)
        return ''.join(self.indices_char[x] for x in x)
    
class colors:
    ok = '\033[92m'
    fail = '\033[91m'
    close = '\033[0m'

## Relevant parameters and training data set generated



In [3]:
# Model and data set parameters
TRAINING_SIZE = 50000 
DIGITS = 3            
INVERT = True 

# the maximum length of enter: 'int + int' (ex, '345+678')
MAXLEN = DIGITS + 1 + DIGITS

# All characters to use (including numbers, plus signs and spaces)
chars = '0123456789+ '
# Create CharacterTable instance
ctable = CharacterTable(chars) 

# Training sentence "xxx + yyy"
questions = [] 
# Training label
expected = []  
seen = set()

print('Generating data...') # 產生訓練資料

while len(questions) < TRAINING_SIZE:
    # Number Generator (3 characters)
    f = lambda: int(''.join(np.random.choice(list('0123456789'))
                           for i in range(np.random.randint(1, DIGITS+1))))
    a, b = f(), f()
    
    # Skip the topics that have been seen and x + Y = Y + x this problem
    key = tuple(sorted((a, b)))
    if key in seen:
        continue    
    seen.add(key)
    
    # When the number is less than MAXLEN then fill the blank
    q = '{}+{}'.format(a, b)
    query = q + ' ' * (MAXLEN - len(q))
    ans = str(a + b)
    
    # The maximum character length of the answer is DIGITS + 1
    ans += ' ' * (DIGITS + 1 - len(ans))
    
    if INVERT:
        # To reverse the direction of the problem character, eg '12 +345 'becomes' 543 + 21'
        query = query[::-1]
    questions.append(query)
    expected.append(ans)
    
print('Total addition questions:', len(questions))

Generating data...
Total addition questions: 50000


## Preprocessing

In [4]:
# The appropriate conversion of data, LSTM expected data structure -> [samples, timesteps, features]
print('Vectorization...')

# The initial three-dimensional numpy ndarray (characteristic data)
x = np.zeros((len(questions), MAXLEN, len(chars)), dtype=np.bool) 
# Initially a 3-D numpy ndarray (label information)
y = np.zeros((len(questions), DIGITS + 1, len(chars)), dtype=np.bool) 

# Convert "feature data" into the LSTM's expected data structure -> [samples, timesteps, features]
for i, sentence in enumerate(questions):
    x[i] = ctable.encode(sentence, MAXLEN)      

print("Feature data: ", x.shape)

# Convert "label data" into the LSTM's expected data structure -> [samples, timesteps, features]
for i, sentence in enumerate(expected):
    y[i] = ctable.encode(sentence, DIGITS + 1)  

print("Label data: ", y.shape)

# Shuffle(x, y)
indices = np.arange(len(y))
np.random.shuffle(indices)
x = x[indices]
y = y[indices]

# Retain 10% of the information for verification
split_at = len(x) - len(x) // 10
(x_train, x_val) = x[:split_at], x[split_at:]
(y_train, y_val) = y[:split_at], y[split_at:]

print('Training Data:')
print(x_train.shape)
print(y_train.shape)

print('Validation Data:')
print(x_val.shape)
print(y_val.shape)

Vectorization...
Feature data:  (50000, 7, 12)
Label data:  (50000, 4, 12)
Training Data:
(45000, 7, 12)
(45000, 4, 12)
Validation Data:
(5000, 7, 12)
(5000, 4, 12)


## Build a network infrastructure

In [7]:
# Try to replace other rnn units, such as GRU or SimpleRNN
RNN = layers.LSTM
HIDDEN_SIZE = 128
BATCH_SIZE = 128
LAYERS = 1

print('Build model...')
model = Sequential()

# ===== encoder ====

# Generate the output of HIDDEN_SIZE using the RNN "code" input sequence.
# Note: With input sequence length variable, use input_shape = (None, num_features)

# MAXLEN stands for timesteps, len (chars) is one-hot-coded features
model.add(RNN(HIDDEN_SIZE, input_shape=(MAXLEN, len(chars)))) 

# As input to the decoder RNN, the last hidden state of the RNN providing each time step is repeated.
# Repeat "DIGITS + 1" times because this is the maximum output length, for example, when DIGITS = 3, 
# the maximum output is 999 + 999 = 1998 (length 4).
model.add(layers.RepeatVector(DIGITS+1))


# ==== decoder ====
# The decoder RNNs can be multi-layer stacks or single layers.
for _ in range(LAYERS):
    # By setting return_sequences to True, not only the last output is returned, but also all outputs are returned as 
    # (num_samples, timesteps, output_dim). This is necessary because the following TimeDistributed requires that the 
    # first dimension be a time step.
    model.add(RNN(HIDDEN_SIZE, return_sequences=True))

# Each entered time slice is pushed to the dense layer to decide which character to select 
# for each time step of the output sequence.
model.add(layers.TimeDistributed(layers.Dense(len(chars))))

model.add(layers.Activation('softmax'))
model.compile(loss='categorical_crossentropy',
             optimizer='adam',
             metrics=['accuracy'])

model.summary()

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 128)               72192     
_________________________________________________________________
repeat_vector_3 (RepeatVecto (None, 4, 128)            0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 4, 128)            131584    
_________________________________________________________________
time_distributed_3 (TimeDist (None, 4, 12)             1548      
_________________________________________________________________
activation_2 (Activation)    (None, 4, 12)             0         
Total params: 205,324
Trainable params: 205,324
Non-trainable params: 0
_________________________________________________________________


## Training Model / Verification Evaluation

In [8]:
for iteration in range(1, 30):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    model.fit(x_train, y_train,
             batch_size=BATCH_SIZE,
             epochs=1,
             validation_data=(x_val, y_val))
    
    for i in range(10):
        ind = np.random.randint(0, len(x_val))
        rowx, rowy = x_val[np.array([ind])], y_val[np.array([ind])]
        preds = model.predict_classes(rowx, verbose=0)
        
        q = ctable.decode(rowx[0])
        correct = ctable.decode(rowy[0])
        guess = ctable.decode(preds[0], calc_argmax=False)
        print('Q', q[::-1] if INVERT else q, end=' ')
        print('T', correct, end=' ')
        if correct == guess:
            print(colors.ok + '☑' + colors.close, end=' ')
        else:
            print(colors.fail + '☒' + colors.close, end=' ')
        print(guess)


--------------------------------------------------
Iteration 1
Train on 45000 samples, validate on 5000 samples
Epoch 1/1
Q 623+449 T 1072 [91m☒[0m 103 
Q 959+26  T 985  [91m☒[0m 109 
Q 518+41  T 559  [91m☒[0m 101 
Q 34+850  T 884  [91m☒[0m 108 
Q 819+40  T 859  [91m☒[0m 109 
Q 321+372 T 693  [91m☒[0m 108 
Q 52+222  T 274  [91m☒[0m 111 
Q 98+861  T 959  [91m☒[0m 109 
Q 116+36  T 152  [91m☒[0m 111 
Q 78+264  T 342  [91m☒[0m 107 

--------------------------------------------------
Iteration 2
Train on 45000 samples, validate on 5000 samples
Epoch 1/1
Q 960+999 T 1959 [91m☒[0m 1610
Q 812+116 T 928  [91m☒[0m 102 
Q 591+171 T 762  [91m☒[0m 102 
Q 74+180  T 254  [91m☒[0m 177 
Q 164+33  T 197  [91m☒[0m 276 
Q 40+607  T 647  [91m☒[0m 576 
Q 29+856  T 885  [91m☒[0m 696 
Q 581+57  T 638  [91m☒[0m 555 
Q 898+82  T 980  [91m☒[0m 902 
Q 25+569  T 594  [91m☒[0m 556 

--------------------------------------------------
Iteration 3
Train on 45000 samples, valida

Q 779+3   T 782  [92m☑[0m 782 
Q 28+203  T 231  [91m☒[0m 230 
Q 144+554 T 698  [92m☑[0m 698 
Q 89+71   T 160  [91m☒[0m 150 
Q 512+68  T 580  [92m☑[0m 580 
Q 413+61  T 474  [92m☑[0m 474 
Q 599+0   T 599  [92m☑[0m 599 
Q 30+892  T 922  [92m☑[0m 922 
Q 61+725  T 786  [92m☑[0m 786 
Q 954+6   T 960  [92m☑[0m 960 

--------------------------------------------------
Iteration 16
Train on 45000 samples, validate on 5000 samples
Epoch 1/1
Q 883+796 T 1679 [92m☑[0m 1679
Q 603+39  T 642  [92m☑[0m 642 
Q 246+4   T 250  [92m☑[0m 250 
Q 256+66  T 322  [92m☑[0m 322 
Q 483+72  T 555  [92m☑[0m 555 
Q 509+65  T 574  [92m☑[0m 574 
Q 669+683 T 1352 [92m☑[0m 1352
Q 36+668  T 704  [92m☑[0m 704 
Q 769+54  T 823  [92m☑[0m 823 
Q 988+71  T 1059 [92m☑[0m 1059

--------------------------------------------------
Iteration 17
Train on 45000 samples, validate on 5000 samples
Epoch 1/1
Q 197+947 T 1144 [91m☒[0m 1143
Q 59+0    T 59   [92m☑[0m 59  
Q 73+38   T 111  [92m☑[0

Q 64+427  T 491  [92m☑[0m 491 
Q 594+262 T 856  [92m☑[0m 856 
Q 349+74  T 423  [92m☑[0m 423 
Q 168+85  T 253  [92m☑[0m 253 
Q 82+161  T 243  [92m☑[0m 243 
Q 962+78  T 1040 [92m☑[0m 1040
Q 3+459   T 462  [92m☑[0m 462 
Q 26+449  T 475  [92m☑[0m 475 
Q 15+893  T 908  [92m☑[0m 908 
Q 461+96  T 557  [92m☑[0m 557 


A precondition for the above method is that it assumes that a given fixed-length sequence may yield a fixed-length target [... t] sequence when input [... t].

This works in some situations, but not in most usage scenarios.