# COMM7370 AI Theories and Applications
# Tutorial: Machine Translation by Keras
## The Problem: Chinese-English translation
In this tutorial, we'll build a simple Chinese-English translation model. 
### Seq2Seq model
A simple Seq2Seq model consists of three parts, Encoder-LSTM, Decoder-LSTM, and Context.The input sequence is ABC, and Encoder-LSTM processes the input sequence and returns the hidden state of the entire input sequence in the last neuron, also known as the context (C).Decoder-LSTM then predicts the next character of the target sequence step by step based on the hidden state.The final output sequence wxyz.It is worth mentioning that the author Sutskever designed a specific Seq2Seq model based on his specific tasks.The input sequence is processed in reverse order, which enables the model to process long sentences and improves the accuracy.
<img src="s2s.jpg" alt="drawing" width="500"/>
### Actural model
<img src="actural.jpg" alt="drawing" width="500"/>

The above image is a real model designed by the Sutskever in [this paper](https://arxiv.org/pdf/1409.3215.pdf).

The input to the encoder LSTM is the sentence in the original language; the input to the decoder LSTM is the sentence in the translated language with a `start-of-sentence` token. The output is the actual target sentence with an `end-of-sentence` token.

## 1. Setup

In [None]:
# install used packages in the current Jupyter kernel
import sys
!{sys.executable} -m pip install keras
!{sys.executable} -m pip install tensorflow
!{sys.executable} -m pip install numpy
!{sys.executable} -m pip install matplotlib
!{sys.executable} -m pip install os

In [None]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding
from keras.optimizers import Adam
import numpy as np
import os

# for macOS
os.environ['KMP_DUPLICATE_LIB_OK']='True'

## 2. Data preprocessing
### Dataset
The language translation model that we are going to develop will translate English sentences into their Chinese language counterparts. To develop such a model, we need a dataset that contains English sentences and their Chinese translations. This dataset is from http://www.manythings.org/anki/.

In [None]:
with open('cmn.txt', 'r', encoding='utf-8') as f:
    data = f.read()
data = data.split('\n')
num_data = 100
data = data[:num_data]
print(data[-5:])

On each line, the text file contains an English sentence and its Chinese translation, separated by a tab. Each line also contains some other information that not related to our translation model and will be omitted.

The dataset contains 22,075 records, but we will use first `num_data` records (which equals to 100 currently) to train our model. You can use more records if you want.

### Data Preprocessing
In our dataset, we do not need to process the input, however, we need to generate two copies of the translated sentence: one with the start-of-sentence token and the other with the end-of-sentence token.
- start-of-sentence token: '\t'
- end-of-sentence token: '\n'

In [None]:
en_data = [line.split('\t')[0] for line in data]
ch_data = ['\t' + line.split('\t')[1] + '\n' for line in data]
print('English data:\n', en_data[:10])
print('\nChinese data:\n', ch_data[:10])

Since `Tokenizer` class doesn't support Chinese documents well, in this tutorial, we manually generate English&Chinese dictionaries and use one-hot encoding to embed words. The following script:
- Generate English&Chinese dictionaries
- Map each character into an index

[`enumerate()`](https://www.geeksforgeeks.org/enumerate-in-python/) method adds a counter to an iterable and returns it in a form of `enumerate object`. This enumerate object can then be used directly in for loops or be converted into a list of tuples using list() method.

In [None]:
# generate English dictionary
en_vocab = set(''.join(en_data))
id2en = list(en_vocab)
en2id = {c:i for i,c in enumerate(id2en)}

# generate Chinese dictionary 
ch_vocab = set(''.join(ch_data))
id2ch = list(ch_vocab)
ch2id = {c:i for i,c in enumerate(id2ch)}

print('\n English dictionary:\n', en2id)
print('\n Chinese dictionary\n:', ch2id)

## 3. Word embedding and padding
Map data sample according to dictionary index.
- `en_num_data` - encoder input data
- `ch_num_data` - decoder input data (with sos)
- `de_num_data` - decoder target output data (with eos)

In [None]:
en_num_data = [[en2id[en] for en in line ] for line in en_data]
ch_num_data = [[ch2id[ch] for ch in line][:-1] for line in ch_data]
de_num_data = [[ch2id[ch] for ch in line][1:] for line in ch_data]

print('char:', en_data[1])
print('index:', en_num_data[1])

In [None]:
# max encoder/decoder sequence length 
max_encoder_seq_length = max([len(txt) for txt in en_num_data])
max_decoder_seq_length = max([len(txt) for txt in ch_num_data])
print('max encoder length:', max_encoder_seq_length)
print('max decoder length:', max_decoder_seq_length)

The following script encodes the English/Chinese sentences with one-hot encoding, and padding the sentences into the same length.

In [None]:
# one-hot encoding
encoder_input_data = np.zeros((len(en_num_data), max_encoder_seq_length, len(en2id)), dtype='float32')
decoder_input_data = np.zeros((len(ch_num_data), max_decoder_seq_length, len(ch2id)), dtype='float32')
decoder_target_data = np.zeros((len(de_num_data), max_decoder_seq_length, len(ch2id)), dtype='float32')

for i in range(len(ch_num_data)):
    for t, j in enumerate(en_num_data[i]):
        encoder_input_data[i, t, j] = 1.
    for t, j in enumerate(ch_num_data[i]):
        decoder_input_data[i, t, j] = 1.
    for t, j in enumerate(de_num_data[i]):
        decoder_target_data[i, t, j] = 1.

print('index data:\n', en_num_data[1])
print('one hot data:\n', encoder_input_data[1])

## 4. Create the Model
Set values for different parameters.

In [None]:
EN_VOCAB_SIZE = len(en2id)
CH_VOCAB_SIZE = len(ch2id)
HIDDEN_SIZE = 256

LEARNING_RATE = 0.003
BATCH_SIZE = 100
EPOCHS = 100

Next, we need to create the encoder and decoders. 
### Encoder
The input to the encoder will be the sentence in English and the output will be the *hidden state* and *cell state* of the LSTM.

In [None]:
encoder_inputs = Input(shape=(None, EN_VOCAB_SIZE))
encoder_lstm = LSTM(HIDDEN_SIZE, return_state=True)

encoder_outputs, state_h, state_c = encoder_lstm(encoder_inputs)

# We discard encoder_outputs and only keep the states.
encoder_states = [state_h, state_c]

[`Input()`](https://keras.io/layers/core/#Input) - used to instantiate a Keras tensor.
- shape: A shape tuple (integer), not including the batch size. Here, encoder_inputs is a 1\*EN_VOCAB_SIZE tensor.

`LSTM` network
- return_state: Boolean. Whether to return the last state in addition to the output.

### Decoder
The decoder will have two inputs: the hidden state and cell state from the encoder and the input sentence, which actually will be the output sentence with an <sos> token appended at the beginning.

In [None]:
decoder_inputs = Input(shape=(None, CH_VOCAB_SIZE))
decoder_lstm = LSTM(HIDDEN_SIZE, return_sequences=True, return_state=True)

# obtain output
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,initial_state=encoder_states)

- return_sequences: Boolean. This argument tells Whether to return the output at each time steps instead of the final time step. 

Finally, the output from the decoder LSTM is passed through a dense layer to predict decoder outputs, as shown below:

In [None]:
decoder_dense = Dense(CH_VOCAB_SIZE, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

### Compile the model

In [None]:
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
opt = Adam(lr=LEARNING_RATE, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(
    optimizer=opt,
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

[`Model`](https://keras.io/models/model/) - given some input tensor(s) and output tensor(s), you can instantiate a Model via `Model()`. This model will include all layers required in the computation of outputs given inputs.

[`Adam`]() optimizer
- beta_1 - The exponential decay rate for the first moment estimates (e.g. 0.9).
- beta_2 - The exponential decay rate for the second-moment estimates (e.g. 0.999). This value should be set close to 1.0 on problems with a sparse gradient (e.g. NLP and computer vision problems).
- epsilon - Is a very small number to prevent any division by zero in the implementation (e.g. 10E-8).

In [None]:
model.summary()

## 5. Train the model

In [None]:
model.fit([encoder_input_data[0:int(num_data*0.9)], decoder_input_data[0:int(num_data*0.9)]], decoder_target_data[0:int(num_data*0.9)],
          batch_size=BATCH_SIZE,
          epochs=EPOCHS)

We can save the trained model to disk so we can load it back up anytime

In [None]:
# Save model #model.save('s2s.h5')

## 6. Modifying the Model for Predictions
While training, we know the actual inputs to the decoder for all the output words in the sequence. The input to the decoder and output from the decoder is known and the model is trained on the basis of these inputs and outputs.

However, during predictions the next word will be predicted on the basis of the previous word, which in turn is also predicted in the previous time-step. Now you will understand the purpose of `sos` and `eos` tokens. While making actual predictions, the full output sequence is not available, in fact that is what we have to predict. During prediction the only word available to us is `sos` since all the output sentences start with `sos`.

The encoder model remains the same:

In [None]:
encoder_model = Model(encoder_inputs, encoder_states)

Since now at each step we need the decoder hidden and cell states, we will modify our model to accept the hidden and cell states as shown below:

In [None]:
decoder_state_input_h = Input(shape=(HIDDEN_SIZE,))
decoder_state_input_c = Input(shape=(HIDDEN_SIZE,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

To make predictions, the decoder output is passed through the dense layer:

In [None]:
decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]
decoder_outputs = decoder_dense(decoder_outputs)

The final step is to define the updated decoder model, as shown here:

In [None]:
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

## 7. Make predictions
In this step, you will see how to make predictions using English sentences as inputs.

We pass the test input sequence to the `encoder_model`, which predicts the hidden state `h` and the cell state `c`.

Next, we define a variable `target_seq`, which is a `1 * 1` matrix of all zeros. The `target_seq` variable contains the first word to the decoder model, which is `sos`.

In the next line, the `outputs` list is defined, which will contain the predicted translation.

Next, we execute a while loop.  
- Inside the loop, in the first iteration, the `decoder_model` predicts the output and the hidden and cell states, using the hidden and cell state of the encoder, and the input token, i.e. `sos`. The index of the predicted word is stored in `sampled_token_index`. The predicted index is then appended to the `outputs` list. The index of the predicted word is stored in the target_seq variable. In the next loop cycle, the updated hidden and cell states, along with the index of the previously predicted word, are used to make new predictions. The loop continues until the maximum output sequence length is achieved or the `eos` token is encountered.

In [None]:
for k in range(int(num_data*0.9),num_data):
    test_data = encoder_input_data[k:k+1]
    # Encode the input as state vectors.
    h, c = encoder_model.predict(test_data)
    target_seq = np.zeros((1, 1, CH_VOCAB_SIZE))
    target_seq[0, 0, ch2id['\t']] = 1
    outputs = []
    while True:
        output_tokens, h, c = decoder_model.predict([target_seq, h, c])
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        outputs.append(sampled_token_index)
        target_seq = np.zeros((1, 1, CH_VOCAB_SIZE))
        target_seq[0, 0, sampled_token_index] = 1
        if sampled_token_index == ch2id['\n'] or len(outputs) > 20: 
            break
    
    print(en_data[k])
    print(''.join([id2ch[i] for i in outputs]))

- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0.