# Translation of Numeric Phrases with Seq2Seq

In the following we will try to build a translation model from french phrases describing numbers to the corresponding digital representation (base 10).

The parallel text data is generated from a "ground-truth" Python function named `to_french_phrase` that captures common rules from the French language except hypenation to make the French strings more ambiguous:

In [None]:
from french_numbers import to_french_phrase


for x in [21, 80, 81, 300, 213, 1100, 1201, 301000, 80080]:
    print(str(x).rjust(6), to_french_phrase(x))

## Generating a Training Set

The following will generate phrases 20000 example phrases for numbers between 1 and 1,000,000 (excluded). It will over-represent small numbers by generating all the possible short sequences between 1 and `exhaustive`.

Let's split the generated set into non-overlapping train, validation and test splits.

In [None]:
from french_numbers import generate_translations
from sklearn.model_selection import train_test_split


numbers, french_numbers = generate_translations(
    low=1, high=int(1e6) - 1, exhaustive=5000, random_seed=0)
num_train, num_dev, fr_train, fr_dev = train_test_split(
    numbers, french_numbers, test_size=0.5, random_state=0)

num_val, num_test, fr_val, fr_test = train_test_split(
    num_dev, fr_dev, test_size=0.5, random_state=0)

In [None]:
len(fr_train), len(fr_val), len(fr_test)

In [None]:
for i, fr_phrase, num_phrase in zip(range(5), fr_train, num_train):
    print(num_phrase.rjust(6), fr_phrase)

In [None]:
for i, fr_phrase, num_phrase in zip(range(5), fr_val, num_val):
    print(num_phrase.rjust(6), fr_phrase)

## Vocabularies

Build the vocabularies from the training set only to get a chance to have some out-of-vocabulary words in the validation and test sets.

First we need to introduce specific symbols that will be used to:
- pad sequences
- mark the beginning of translation
- mark the end of translation
- be used as a placehold for out-of-vocabulary symbols (not seen in the training set).

Here we use the same convention as the [tensorflow seq2seq tutorial](https://www.tensorflow.org/tutorials/seq2seq):

In [None]:
START_VOCAB = ['_PAD', '_GO', '_EOS', '_UNK']

To build the vocabulary we need to tokenize the sequences of symbols. For the digital number representation we use character level tokenization while whitespace-based word level tokenization will do for the French phrases:

In [None]:
def tokenize(sentence, word_level=True):
    if word_level:
        return sentence.split()
    else:
        return [sentence[i:i + 1] for i in range(len(sentence))]

In [None]:
tokenize('1234', word_level=False)

In [None]:
tokenize('mille deux cent trente quatre', word_level=True)

Let's now use this tokenization strategy to assign a unique integer token id to each possible token string found the traing set in each language ('French' and 'numeric'): 

In [None]:
def build_vocabulary(sentences, word_level=True):
    rev_vocabulary = START_VOCAB[:]
    unique_tokens = set()
    for sentence in sentences:
        tokens = tokenize(sentence, word_level=word_level)
        unique_tokens.update(tokens)
    rev_vocabulary += sorted(unique_tokens)
    vocabulary = {}
    for i, token in enumerate(rev_vocabulary):
        vocabulary[token] = i
    return vocabulary, rev_vocabulary

In [None]:
fr_vocab, rev_fr_vocab = build_vocabulary(fr_train, word_level=True)
num_vocab, rev_num_vocab = build_vocabulary(num_train, word_level=False)

The two languages do not have the same vocabulary sizes:

In [None]:
len(fr_vocab)

In [None]:
len(num_vocab)

In [None]:
for k, v in sorted(fr_vocab.items())[:10]:
    print(k.rjust(10), v)
print('...')

In [None]:
for k, v in sorted(num_vocab.items()):
    print(k.rjust(10), v)

In [None]:
print(rev_fr_vocab)

In [None]:
print(rev_num_vocab)