# Home 5: Build a seq2seq model for machine translation.

### Name: Michael DiGregorio

### Task: Translate English to Spanish

## 0. You will do the following:

1. Read and run my code.
2. Complete the code in Section 1.1 and Section 4.2.

    * Translation **English** to **German** is not acceptable!!! Try another pair of languages.
    
3. **Make improvements.** Directly modify the code in Section 3. Do at least one of the two. By doing both correctly, you will get up to 1 bonus score to the total.

    * Bi-LSTM instead of LSTM.
        
    * Attention. (You are allowed to use existing code.)
    
4. Evaluate the translation using the BLEU score. 

    * Optional. Up to 1 bonus scores to the total.
    
5. Convert the notebook to .HTML file. 

    * The HTML file must contain the code and the output after execution.

6. Put the .HTML file in your Google Drive, Dropbox, or Github repo.  (If you submit the file to Google Drive or Dropbox, you must make the file "open-access". The delay caused by "deny of access" may result in late penalty.)

7. Submit the link to the HTML file to Canvas.    


### Hint: 

To implement ```Bi-LSTM```, you will need the following code to build the encoder. Do NOT use Bi-LSTM for the decoder.

In [1]:
# from tensorflow.keras.layers import Bidirectional, Concatenate, LSTM

# encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True, 
#                                   dropout=0.5, name='encoder_lstm'))
# _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

# state_h = Concatenate()([forward_h, backward_h])
# state_c = Concatenate()([forward_c, backward_c])

## I am going to rewrite a lot of this becuase its horribly confusing to work with 

## 1. Data preparation

1. Download data (e.g., "deu-eng.zip") from http://www.manythings.org/anki/
2. Unzip the .ZIP file.
3. Put the .TXT file (e.g., "deu.txt") in the directory "./Data/".

In [11]:
# imports
import re
import numpy
import string
import tensorflow as tf
from IPython.display import SVG
from unicodedata import normalize
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import model_to_dot, plot_model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, LSTM, Dense, Bidirectional, Concatenate

### 1.1. Load and clean text


In [12]:
# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def clean_data(lines):
    cleaned = list()
    # prepare regex for char filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)
    for pair in lines:
        clean_pair = list()
        for line in pair:
            # normalize unicode characters
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            # tokenize on white space
            line = line.split()
            # convert to lowercase
            line = [word.lower() for word in line]
            # remove punctuation from each token
            line = [word.translate(table) for word in line]
            # remove non-printable chars form each token
            line = [re_print.sub('', w) for w in line]
            # remove tokens with numbers in them
            line = [word for word in line if word.isalpha()]
            # store as string
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
    return numpy.array(cleaned)

#### Fill the following blanks:

In [13]:
# e.g., filename = 'Data/deu.txt'
filename = 'data/spa.txt'

# e.g., n_train = 20000
n_train = 120000

In [14]:
# load dataset
doc = load_doc(filename)
# split into Language1-Language2 pairs
pairs = to_pairs(doc)
# clean sentences
clean_pairs = clean_data(pairs)[0:n_train, :]

print(f"doc entry {doc[100]}\n")
print(f"pairs entry {pairs[100]}\n")
print(f"clean pairs entry {clean_pairs[100]}\n")

doc entry C

pairs entry ['No way!', '¡Mangos!', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2175 (CS) & #3843189 (cueyayotl)']

clean pairs entry ['no way' 'mangos' 'ccby france attribution tatoebaorg cs cueyayotl']



In [15]:
for i in range(3000, 3010):
    print('[' + clean_pairs[i, 0] + '] => [' + clean_pairs[i, 1] + ']')

[fix the roof] => [arregla el tejado]
[get the book] => [trae el libro]
[get the book] => [consigue el libro]
[get the book] => [recoge el libro]
[get the book] => [traiga el libro]
[get the book] => [recoja el libro]
[get upstairs] => [anda para arriba]
[ghosts exist] => [los fantasmas existen]
[give me half] => [dame la mitad]
[give me half] => [deme la mitad]


In [16]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (120000,)
Length of target_texts: (120000,)


In [17]:
# max encoder seq length is the longest line of the input sentences
max_encoder_seq_length = max(len(line) for line in input_texts)
# max decoder seq length is the longest line of the translated target sentences
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 52
max length of target sentences: 102


**Remark:** To this end, you have two lists of sentences: input_texts and target_texts

## 2. Text processing

### 2.1. Convert texts to sequences

- Input: A list of $n$ sentences (with max length $t$).
- It is represented by a $n\times t$ matrix after the tokenization and zero-padding.

In [40]:
class Translator(Model):
    def __init__(self, latent_dim: int, epochs: int):
        super(Translator, self).__init__()
        
        self.latent_dim = latent_dim
        self.epochs = epochs
        
    
    def call(self, english_sentences):
        input_seq, input_token_index = self.generate_input_sequences(english_sentences)
        
        states_value = self.encoder_model.predict(input_seq)

        target_seq = numpy.zeros((1, 1, self.num_decoder_tokens))
        target_seq[0, 0, self.target_token_index['\t']] = 1.

        stop_condition = False
        decoded_sentence = ''
        while not stop_condition:
            output_tokens, h, c = self.decoder_model.predict([target_seq] + states_value)

            # this line of code is greedy selection
            # try to use multinomial sampling instead (with temperature)
            sampled_token_index = numpy.argmax(output_tokens[0, -1, :])

            sampled_char = self.reverse_target_char_index[sampled_token_index]
            decoded_sentence += sampled_char

            if (sampled_char == '\n' or
               len(decoded_sentence) > self.max_decoder_seq_length):
                stop_condition = True

            target_seq = numpy.zeros((1, 1, self.num_decoder_tokens))
            target_seq[0, 0, sampled_token_index] = 1.

            states_value = [h, c]

        return decoded_sentence
    
    def generate_input_sequences(self, sentences):
        seqs = self.encoder_tokenizer.texts_to_sequences(sentences)
        encoder_input_seq = pad_sequences(seqs, maxlen=self.max_encoder_seq_length, padding='post')
        encoder_input_data = self.onehot_encode(encoder_input_seq, self.max_encoder_seq_length, self.num_encoder_tokens)
        return encoder_input_data, self.encoder_tokenizer.word_index

    
    def fit(self, input_text, target_text):
        # generate sequence constraints
        # max encoder seq length is the longest line of the input sentences
        self.max_encoder_seq_length = max(len(line) for line in input_text)
        # max decoder seq length is the longest line of the translated target sentences
        self.max_decoder_seq_length = max(len(line) for line in target_text)
        
        # process the input text
        self.encoder_tokenizer, encoder_input_seq, input_token_index = self.text2sequences(self.max_encoder_seq_length, 
                                                      input_text)
        self.decoder_tokenizer, decoder_input_seq, target_token_index = self.text2sequences(self.max_decoder_seq_length, 
                                                       target_text)
        self.target_token_index = target_token_index
        print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
        print('shape of input_token_index: ' + str(len(input_token_index)))
        print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
        print('shape of target_token_index: ' + str(len(target_token_index)))
        self.num_encoder_tokens = len(input_token_index) + 1
        self.num_decoder_tokens = len(target_token_index) + 1

        print('num_encoder_tokens: ' + str(self.num_encoder_tokens))
        print('num_decoder_tokens: ' + str(self.num_decoder_tokens))
        print(target_text[100])
        print(decoder_input_seq[100, :])
        
        encoder_input_data = self.onehot_encode(encoder_input_seq, 
                                                self.max_encoder_seq_length, 
                                                self.num_encoder_tokens)
        
        decoder_input_data = self.onehot_encode(decoder_input_seq, 
                                                self.max_decoder_seq_length, 
                                                self.num_decoder_tokens)

        decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
        decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
        
        decoder_target_data = self.onehot_encode(decoder_target_seq, 
                                        self.max_decoder_seq_length, 
                                        self.num_decoder_tokens)
        
        print(encoder_input_data.shape)
        print(decoder_input_data.shape)
        
        # Reverse-lookup token index to decode sequences back to something readable.
        self.reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
        self.reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())
        
        self.encoder_model, self.decoder_model, self.model = self.generate_models()

        self.plot_model(self.encoder_model, 'encoder2.pdf')
        self.plot_model(self.decoder_model, 'decoder2.pdf')
        self.plot_model(self.model, 'model2.pdf')
        
        self.model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

        self.model.fit([encoder_input_data, decoder_input_data],  # training data
                  decoder_target_data,                       # labels (left shift of the target sequences)
                  batch_size=64, epochs=self.epochs, validation_split=0.2)
    
    def generate_models(self):
        encoder_inputs = Input(shape=(None, self.num_encoder_tokens),
                       name='encoder_inputs')
        
        encoder_bilstm = Bidirectional(LSTM(self.latent_dim, return_state=True, 
                                          dropout=0.5, name='encoder_lstm'))
        _, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

        state_h = Concatenate()([forward_h, backward_h])
        state_c = Concatenate()([forward_c, backward_c])
        print(state_h.shape)
        print(state_c.shape)
        encoder_model = Model(inputs=encoder_inputs, 
                              outputs=[state_h, state_c],
                              name='encoder')
        
        # inputs of the decoder network
        decoder_input_h = Input(shape=(2*self.latent_dim,), name='decoder_input_h')
        decoder_input_c = Input(shape=(2*self.latent_dim,), name='decoder_input_c')
        decoder_input_x = Input(shape=(None, self.num_decoder_tokens), name='decoder_input_x')

        # set the LSTM layer
        decoder_lstm = LSTM(2*self.latent_dim, return_sequences=True, 
                            return_state=True, dropout=0.5, name='decoder_lstm')
        decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x, 
                                                              initial_state=[decoder_input_h, decoder_input_c])

        # set the dense layer
        decoder_dense = Dense(self.num_decoder_tokens, activation='softmax', name='decoder_dense')
        decoder_outputs = decoder_dense(decoder_lstm_outputs)

        # build the decoder network model
        decoder_model= Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                             outputs=[decoder_outputs, state_h, state_c],
                             name='decoder')

        # input layers
        encoder_input_x = Input(shape=(None, self.num_encoder_tokens), name='encoder_input_x')
        decoder_input_x = Input(shape=(None, self.num_decoder_tokens), name='decoder_input_x')

        # connect encoder to decoder
        encoder_final_states = encoder_model([encoder_input_x])
        decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
        decoder_pred = decoder_dense(decoder_lstm_output)

        model = Model(inputs=[encoder_input_x, decoder_input_x], 
                      outputs=decoder_pred, 
                      name='model_training')
        
        return encoder_model, decoder_model, model
    
    
    @staticmethod
    def plot_model(model, outfile):
        SVG(model_to_dot(model, show_shapes=False).create(prog='dot', format='svg'))

        plot_model(
            model=model, show_shapes=False,
            to_file=outfile
        )

        model.summary()
        
    @staticmethod
    def text2sequences(max_len, lines):
        tokenizer = Tokenizer(char_level=True, filters='')
        tokenizer.fit_on_texts(lines)
        seqs = tokenizer.texts_to_sequences(lines)
        seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')
        return tokenizer, seqs_pad, tokenizer.word_index
    
    @staticmethod
    def onehot_encode(sequences, max_len, vocab_size):
        n = len(sequences)
        data = numpy.zeros((n, max_len, vocab_size))
        for i in range(n):
            data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
        return data

In [None]:
translator = Translator(latent_dim=256, epochs=30)
translator.fit(input_texts, target_texts)

shape of encoder_input_seq: (120000, 52)
shape of input_token_index: 27
shape of decoder_input_seq: (120000, 102)
shape of target_token_index: 29
num_encoder_tokens: 28
num_decoder_tokens: 30
	mangos

[13 15  3  6 22  4  5 14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  0  0  0  0]
(120000, 52, 28)
(120000, 102, 30)
(None, 512)
(None, 512)
Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     [(None, None, 28)]   0                                            
__________________________________________________________________________________________________
bidirectional_1

In [None]:
print(translator(["I love you", 
                  "how are you", 
                  "I am good at cooking", 
                  "good morning", 
                  "the sky is blue", 
                  "I am very tall"]))

## 5. Evaluate the translation using BLEU score

Reference: 
- https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
- https://en.wikipedia.org/wiki/BLEU


**Hint:** 

- Randomly partition the dataset to training, validation, and test. 

- Evaluate the BLEU score using the test set. Report the average.

- A reasonable BLEU score should be 0.1 ~ 0.5.