# Translation of Numeric Phrases with Seq2Seq

In the following we will try to build a translation model from french phrases describing numbers to the corresponding digital representation (base 10).

The parallel text data is generated from a "ground-truth" Python function named `to_french_phrase` that captures common rules from the French language except hypenation to make the French strings more ambiguous:

In [None]:
from french_numbers import to_french_phrase


for x in [21, 80, 81, 300, 213, 1100, 1201, 301000, 80080]:
    print(str(x).rjust(6), to_french_phrase(x))

## Generating a Training Set

The following will generate phrases 20000 example phrases for numbers between 1 and 1,000,000 (excluded). It will over-represent small numbers by generating all the possible short sequences between 1 and `exhaustive`.

Let's split the generated set into non-overlapping train, validation and test splits.

In [None]:
from french_numbers import generate_translations
from sklearn.model_selection import train_test_split


numbers, french_numbers = generate_translations(
    low=1, high=int(1e6) - 1, exhaustive=5000, random_seed=0)
num_train, num_dev, fr_train, fr_dev = train_test_split(
    numbers, french_numbers, test_size=0.5, random_state=0)

num_val, num_test, fr_val, fr_test = train_test_split(
    num_dev, fr_dev, test_size=0.5, random_state=0)

In [None]:
len(fr_train), len(fr_val), len(fr_test)

In [None]:
for i, fr_phrase, num_phrase in zip(range(5), fr_train, num_train):
    print(num_phrase.rjust(6), fr_phrase)

In [None]:
for i, fr_phrase, num_phrase in zip(range(5), fr_val, num_val):
    print(num_phrase.rjust(6), fr_phrase)

## Vocabularies

Build the vocabularies from the training set only to get a chance to have some out-of-vocabulary words in the validation and test sets.

First we need to introduce specific symbols that will be used to:
- pad sequences
- mark the beginning of translation
- mark the end of translation
- be used as a placehold for out-of-vocabulary symbols (not seen in the training set).

Here we use the same convention as the [tensorflow seq2seq tutorial](https://www.tensorflow.org/tutorials/seq2seq):

In [None]:
PAD, GO, EOS, UNK = START_VOCAB = ['_PAD', '_GO', '_EOS', '_UNK']

To build the vocabulary we need to tokenize the sequences of symbols. For the digital number representation we use character level tokenization while whitespace-based word level tokenization will do for the French phrases:

In [None]:
def tokenize(sentence, word_level=True):
    if word_level:
        return sentence.split()
    else:
        return [sentence[i:i + 1] for i in range(len(sentence))]

In [None]:
tokenize('1234', word_level=False)

In [None]:
tokenize('mille deux cent trente quatre', word_level=True)

Let's now use this tokenization strategy to assign a unique integer token id to each possible token string found the traing set in each language ('French' and 'numeric'): 

In [None]:
def build_vocabulary(tokenized_sequences):
    rev_vocabulary = START_VOCAB[:]
    unique_tokens = set()
    for tokens in tokenized_sequences:
        unique_tokens.update(tokens)
    rev_vocabulary += sorted(unique_tokens)
    vocabulary = {}
    for i, token in enumerate(rev_vocabulary):
        vocabulary[token] = i
    return vocabulary, rev_vocabulary

In [None]:
tokenized_fr_train = [tokenize(s, word_level=True) for s in fr_train]
tokenized_num_train = [tokenize(s, word_level=False) for s in num_train]

fr_vocab, rev_fr_vocab = build_vocabulary(tokenized_fr_train)
num_vocab, rev_num_vocab = build_vocabulary(tokenized_num_train)

The two languages do not have the same vocabulary sizes:

In [None]:
len(fr_vocab)

In [None]:
len(num_vocab)

In [None]:
for k, v in sorted(fr_vocab.items())[:10]:
    print(k.rjust(10), v)
print('...')

In [None]:
for k, v in sorted(num_vocab.items()):
    print(k.rjust(10), v)

We also built the reverse mappings from token ids to token string representations:

In [None]:
print(rev_fr_vocab)

In [None]:
print(rev_num_vocab)

## Seq2Seq with a single GRU architecture

<img src="images/basic_seq2seq.png" width="80%" />

From: [Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS 2014](https://arxiv.org/abs/1409.3215)



For a given source sequence - target sequence pair, we will:
- tokenize the source and target sequences;
- reverse the order of the source sequence;
- build the input sequence by concatenating the reversed source sequence and the target sequence in original order using the `_GO` token as a delimiter, 
- build the output sequence by appending the `_EOS` token to the source sequence.


Let's do this as a function using the original string representations for the tokens so as to make it easier to debug:

In [None]:
def make_input_output(source_tokens, target_tokens, reverse_source=True):
    if reverse_source:
        source_tokens = source_tokens[::-1]
    input_tokens = source_tokens + [GO] + target_tokens
    output_tokens = target_tokens + [EOS]
    return input_tokens, output_tokens

In [None]:
input_tokens, output_tokens = make_input_output(
    ['cent', 'vingt', 'et', 'un'],
    ['1', '2', '1'],
)

In [None]:
input_tokens

In [None]:
output_tokens

### Vectorization of the parallel corpus

Let's apply the previous transformation to each pair of (source, target) sequene and use a shared vocabulary to store the results in numpy arrays of integer token ids, with padding on the left so that all input / output sequences have the same length: 

In [None]:
all_tokenized_sequences = tokenized_fr_train + tokenized_num_train
shared_vocab, rev_shared_vocab = build_vocabulary(all_tokenized_sequences)

In [None]:
import numpy as np
max_length = 20  # found by introspection of our training set

def vectorize_corpus(source_sequences, target_sequences, shared_vocab,
                     word_level_source=True, word_level_target=True,
                     max_length=max_length):
    assert len(source_sequences) == len(target_sequences)
    n_sequences = len(source_sequences)
    source_ids = np.empty(shape=(n_sequences, max_length), dtype=np.int32)
    source_ids.fill(shared_vocab[PAD])
    target_ids = np.empty(shape=(n_sequences, max_length), dtype=np.int32)
    target_ids.fill(shared_vocab[PAD])
    numbered_pairs = zip(range(n_sequences), source_sequences, target_sequences)
    for i, source_seq, target_seq in numbered_pairs:
        source_tokens = tokenize(source_seq, word_level=word_level_source)
        target_tokens = tokenize(target_seq, word_level=word_level_target)
        
        in_tokens, out_tokens = make_input_output(source_tokens, target_tokens)
        
        in_token_ids = [shared_vocab.get(t, UNK) for t in in_tokens]
        source_ids[i, -len(in_token_ids):] = in_token_ids
    
        out_token_ids = [shared_vocab.get(t, UNK) for t in out_tokens]
        target_ids[i, -len(out_token_ids):] = out_token_ids
    return source_ids, target_ids

In [None]:
X_train, Y_train = vectorize_corpus(fr_train, num_train, shared_vocab,
                                    word_level_target=False)

In [None]:
X_train.shape

In [None]:
Y_train.shape

In [None]:
fr_train[0]

In [None]:
num_train[0]

In [None]:
X_train[0]

In [None]:
Y_train[0]

This looks good. In particular we can note:

- the PAD=0 symbol at the beginning of the two sequences,
- the input sequence has the GO=1 symbol to separate the source from the target,
- the output sequence is a shifted version of the target and ends with EOS=2.

Let's vectorize the validation and test set to be able to evaluate our models:

In [None]:
X_val, Y_val = vectorize_corpus(fr_val, num_val, shared_vocab,
                                word_level_target=False)
X_test, Y_test = vectorize_corpus(fr_test, num_test, shared_vocab,
                                  word_level_target=False)

In [None]:
X_val.shape, Y_val.shape

In [None]:
X_test.shape, Y_test.shape

### A simple homogeneous Seq2Seq architecture


To keep the architecture simple we will use the same RNN architecture and weights for the encoder part (before the `_GO` token) and the decoder part (after the `_GO` token).

Here we use the GRU recurrent cell instead of LSTM because it is slightly faster to compute and should give comparable results.

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Dropout, GRU, Dense

vocab_size = len(shared_vocab)
simple_seq2seq = Sequential()
simple_seq2seq.add(Embedding(vocab_size, 32, input_length=max_length))
simple_seq2seq.add(Dropout(0.2))
simple_seq2seq.add(GRU(128, return_sequences=True))
simple_seq2seq.add(Dense(vocab_size, activation='softmax'))

# Here we use the sparse_categorical_crossentropy loss to be able to pass
# integer coded output for the token ids without having to 
simple_seq2seq.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

In [None]:
simple_seq2seq.fit(X_train, Y_train[:, :, np.newaxis],
                   validation_data=(X_val, Y_val[:, :, np.newaxis]),
                   nb_epoch=1000, verbose=2)

Let's have a look at a raw prediction on the test set:

In [None]:
fr_test[0]

In [None]:
X_test[0]

In [None]:
prediction = simple_seq2seq.predict(X_test[0:1])[0]

In [None]:
prediction.shape

In [None]:
prediction.argmax(-1)

In [None]:
num_test[0]

In [None]:
def translate(model, source_sequence, shared_vocab, rev_shared_vocab,
              word_level_source=True, word_level_target=True):
    max_length = simple_seq2seq.input_shape[1]
    source_tokens = tokenize(source_sequence, word_level=word_level_source)
    input_ids = [shared_vocab.get(t, UNK) for t in source_tokens[::-1]]
    input_ids += [shared_vocab[GO]]
    decoded = []
    while len(input_ids) <= max_length:
        X = np.zeros(shape=(1, max_length), dtype=np.int32)
        X[0, -len(input_ids):] = input_ids
        next_id = model.predict(X)[0, -1].argmax()
        if next_id == shared_vocab[EOS]:
            break
        decoded.append(rev_shared_vocab[next_id])
        input_ids.append(next_id)
    return " ".join(decoded) if word_level_target else "".join(decoded)

In [None]:
translate(simple_seq2seq, "onze", shared_vocab, rev_shared_vocab,
          word_level_target=False)

In [None]:
translate(simple_seq2seq, "cent trente deux", shared_vocab, rev_shared_vocab,
          word_level_target=False)

In [None]:
translate(simple_seq2seq, "mille cent trente deux", shared_vocab, rev_shared_vocab,
          word_level_target=False)

In [None]:
translate(simple_seq2seq, "quatre mille cent trente deux", shared_vocab, rev_shared_vocab,
          word_level_target=False)

In [None]:
translate(simple_seq2seq, "quatre mille deux cent deux", shared_vocab, rev_shared_vocab,
          word_level_target=False)

In [None]:
translate(simple_seq2seq, "vingt quatre", shared_vocab, rev_shared_vocab,
          word_level_target=False)

In [None]:
translate(simple_seq2seq, "treize mille deux cent trois", shared_vocab, rev_shared_vocab,
          word_level_target=False)