# **Building a seq2seq model for machine translation.**

## **Language Choice and Details**

I initially wanted to build a model to translate English to **Tamil** which happens to be my Native Language so that I could work with the translations easily. But due to less number of available sentence pairs (207) on the https://www.manythings.org/anki/ website for this language, I picked **Hebrew**.

Though Hebrew has been a language of fascination for me for a while now, there are a few other important reasons I picked Hebrew for learning machine translation.

- Hebrew is **written from right to left**, it will be interesting to see if Bi-LSTM and Attention produces better results.

- English and Hebrew are from completely different language-families and roots. (https://webspace.ship.edu/cgboer/languagefamilies.html)
They are from completely different regions and time-periods and have isolated places of origins.
  - English -> **Region:** West Germany, **Language family:** Indo-European, **Root:** Germanic
  - Hebrew -> **Region:** Israel, **Language family:** Afro-Asiatic, **Root:** Semitic

- They share almost a near-zero lexical similarity (https://en.wikipedia.org/wiki/Lexical_similarity) and less genetic relationship and linguistic interference. (https://en.wikipedia.org/wiki/Genetic_relationship_(linguistics))

- English-Hebrew also has a good number of sentence pairs available on the https://www.manythings.org/anki/ website(127856)



## Data preparation

heb.txt under home directory

### Load and clean text


#### **Since canonical normalization (normalize-NFD) does not work with Hebrew characters, I'm using an open-source library for tokenizing Hebrew text (https://github.com/YontiLevin/Hebrew-Tokenizer)**

In [86]:
pip install hebrew_tokenizer



In [88]:
# hebrew tokenizer by github user 'YontiLevin'
import hebrew_tokenizer as ht

In [89]:
import re
import string
import numpy as np
from unicodedata import normalize

# load doc into memory
def load_doc(filename):
    # open the file as read only
    file = open(filename, mode='rt', encoding='utf-8')
    # read all text
    text = file.read()
    # close the file
    file.close()
    return text


# split a loaded document into sentences
def to_pairs(doc):
    lines = doc.strip().split('\n')
    pairs = [line.split('\t') for line in  lines]
    return pairs

def is_hebrew(text):
    # Check if the text contains Hebrew characters
    return any('\u0590' <= char <= '\u05EA' for char in text)

def clean_data(lines):
    cleaned = []  # Initialize an empty list for storing cleaned data

    # Prepare regex for character filtering
    re_print = re.compile('[^%s]' % re.escape(string.printable))
    # Prepare translation table for removing punctuation
    table = str.maketrans('', '', string.punctuation)

    for pair in lines:
        clean_pair = []

        for line in pair:
            # separate processing for English and Hebrew
            if is_hebrew(line):
                clean_tokens = []
                # Tokenize the Hebrew line
                tokens = ht.tokenize(line)

                for grp, token, token_num, (start_index, end_index) in tokens:
                  clean_tokens.append(('{}'.format(token)))

                # Convert to lowercase
                tokens = [word.lower() for word in tokens]

                # Remove punctuation from each word
                clean_tokens = [remove_punctuation_from_word(word, table) for word in clean_tokens]

                # Remove non-printable characters from each token
                #clean_tokens = [re_print.sub('', w) for w in tokens]

                # Remove tokens with numbers in them
                clean_tokens = [word for word in clean_tokens if word.isalpha()]
            else:
                # Normalize Unicode characters
                line = normalize('NFD', line).encode('ascii', 'ignore')
                line = line.decode('UTF-8')

                # Tokenize on whitespace
                tokens = line.split()

                # Convert to lowercase
                tokens = [word.lower() for word in tokens]

                # Remove punctuation from each word
                clean_tokens = [remove_punctuation_from_word(word, table) for word in tokens]

                # Remove non-printable characters from each token
                clean_tokens = [re_print.sub('', w) for w in clean_tokens]

                # Remove tokens with numbers in them
                clean_tokens = [word for word in clean_tokens if word.isalpha()]

            # Store as a string
            clean_line = ' '.join(clean_tokens)
            clean_pair.append(clean_line)

        cleaned.append(clean_pair)

    return np.array(cleaned)

def remove_punctuation_from_word(word, table):
    return word.translate(table)


In [90]:
# filename
filename = 'heb.txt'

#number of sentences for training
#n_train = 100000

#### **After cleaning I randomly split the data to 80:20 for train and test** (validation data is 20% within training set and is used for tuning the model)
**Optional:** Ignore last few sentence pairs for faster training as they are huge and model requires enormous resources and a lot of time to train after OHE

In [96]:
# load dataset
doc = load_doc(filename)

# split into Language1-Language2 pairs
pairs = to_pairs(doc)

# Clean sentences
clean= clean_data(pairs)

# optional - ignore high length sentences (if RAM is low)
clean = clean[0: 35000]

# Shuffle the pairs randomly
np.random.shuffle(clean)

# Calculate the split index for 80% training and 20% test
split_index = int(0.8 * len(clean))

# Split the pairs into training and test sets
clean_pairs = clean[:split_index]
clean_pairs_test = clean[split_index:]

# Print the number of pairs in each set
print("Number of training pairs:", len(clean_pairs))
print("Number of test pairs:", len(clean_pairs_test))

Number of training pairs: 28000
Number of test pairs: 7000


In [97]:
# print sample - not cleaned
print(pairs[5000])

["They're fine.", 'הם בסדר.', 'CC-BY 2.0 (France) Attribution: tatoeba.org #2111291 (CK) & #5426571 (fekundulo)']


#### Training set

In [98]:
# print samples after cleaning - training
print(clean_pairs[5000])

['i have to dress up' 'אני צריכה להתלבש'
 'ccby france attribution tatoebaorg ck nava']


In [99]:
for i in range(5000, 5010):
    print('[' + clean_pairs[i][0] + '] => [' + clean_pairs[i][1] + ']')

[i have to dress up] => [אני צריכה להתלבש]
[he appeared young] => [הוא נראה צעיר]
[its a catchy song] => [זה שיר קליט]
[what dont you have] => [מה אין לך]
[look at the sky] => [תסתכלי על השמיים]
[i did it for you] => [עשיתי את זה בשבילך]
[could that change] => [זה יכול להשתנות]
[you cheated] => [רמית]
[ill be in the car] => [אהיה במכונית]
[long time no see] => [מזמן לא התראינו]


In [100]:
input_texts = clean_pairs[:, 0]
target_texts = ['\t' + text + '\n' for text in clean_pairs[:, 1]]

print('Length of input_texts:  ' + str(input_texts.shape))
print('Length of target_texts: ' + str(input_texts.shape))

Length of input_texts:  (28000,)
Length of target_texts: (28000,)


In [101]:
print(input_texts[5000])
print(target_texts[5000])

i have to dress up
	אני צריכה להתלבש



In [102]:
# highest length of inputs and outputs
max_encoder_seq_length = max(len(line) for line in input_texts)
max_decoder_seq_length = max(len(line) for line in target_texts)

print('max length of input  sentences: %d' % (max_encoder_seq_length))
print('max length of target sentences: %d' % (max_decoder_seq_length))

max length of input  sentences: 20
max length of target sentences: 44


#### Test set

In [103]:
# print samples after cleaning - test
print(clean_pairs_test[5000])

['tom always lies' 'תום תמיד משקר'
 'ccby france attribution tatoebaorg joseph fekundulo']


In [104]:
for i in range(5000, 5010):
    print('[' + clean_pairs_test[i][0] + '] => [' + clean_pairs_test[i][1] + ']')

[tom always lies] => [תום תמיד משקר]
[can you go get it] => [אתה יכול להביא את זה]
[i like them] => [אני אוהב אותם]
[i never worry] => [לעולם אינני דואג]
[were not ready] => [אנו לא מוכנות]
[he and i are cousins] => [הוא ואני בני דודים]
[you are so stupid] => [אתה כזה טמבל]
[does tom know why] => [טום יודע למה]
[they got nothing] => [הם לא קיבלו דבר]
[i didnt order it] => [לא הזמנתי את זה]


In [105]:
test_input_texts = clean_pairs_test[:, 0]
test_target_texts = ['\t' + text + '\n' for text in clean_pairs_test[:, 1]]

print('Length of input_texts:  ' + str(test_input_texts.shape))
print('Length of target_texts: ' + str(test_input_texts.shape))

Length of input_texts:  (7000,)
Length of target_texts: (7000,)


## Text processing

### Convert texts to sequences

##### **Strategy:** If language is Hebrew, I tried adding padding to the left instead (since Hebrew writing is from right to left) - Used in model C (results were not good)

##### **Strategy** : Reverse the token sequence for right to left learning in Hebrew

In [108]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences


# encode and pad sequences

def text2sequences_heb(max_len, lines, lang):

    tokenizer = Tokenizer(char_level=True,filters='')
    tokenizer.fit_on_texts(lines)
    seqs = tokenizer.texts_to_sequences(lines)

    if lang == "Heb":
      seqs.reverse()
    #    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='pre')
    #else:
    #    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')

    seqs_pad = pad_sequences(seqs, maxlen=max_len, padding='post')

    return seqs_pad, tokenizer.word_index

#### text->seq : Training set

In [109]:
encoder_input_seq, input_token_index = text2sequences_heb(max_encoder_seq_length,
                                                      input_texts, lang="Eng")
decoder_input_seq, target_token_index = text2sequences_heb(max_decoder_seq_length,
                                                       target_texts, lang="Heb")


print('shape of encoder_input_seq: ' + str(encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(input_token_index)))

print('shape of decoder_input_seq: ' + str(decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(target_token_index)))

shape of encoder_input_seq: (28000, 20)
shape of input_token_index: 27
shape of decoder_input_seq: (28000, 44)
shape of target_token_index: 46


In [110]:
print(encoder_input_seq[5000])

[ 5  1 10  6 23  2  1  3  4  1 12  9  2  7  7  1 14 19  0  0]


In [111]:
print(input_token_index)


{' ': 1, 'e': 2, 't': 3, 'o': 4, 'i': 5, 'a': 6, 's': 7, 'n': 8, 'r': 9, 'h': 10, 'l': 11, 'd': 12, 'm': 13, 'u': 14, 'y': 15, 'w': 16, 'c': 17, 'g': 18, 'p': 19, 'k': 20, 'b': 21, 'f': 22, 'v': 23, 'j': 24, 'x': 25, 'q': 26, 'z': 27}


In [112]:
print(target_token_index)

{' ': 1, 'י': 2, 'ו': 3, 'ה': 4, '\t': 5, '\n': 6, 'א': 7, 'ת': 8, 'ל': 9, 'נ': 10, 'ר': 11, 'מ': 12, 'ם': 13, 'ב': 14, 'ש': 15, 'ע': 16, 'כ': 17, 'ח': 18, 'ז': 19, 'ד': 20, 'ק': 21, 'פ': 22, 'ס': 23, 'צ': 24, 'ג': 25, 'ט': 26, 'ך': 27, 'ן': 28, 'ף': 29, 'ץ': 30, 'i': 31, 'b': 32, 'f': 33, 'a': 34, 'r': 35, 'm': 36, 'w': 37, 'c': 38, 'n': 39, 'd': 40, 'v': 41, 'l': 42, 'o': 43, 'y': 44, 'k': 45, 'e': 46}


In [113]:
num_encoder_tokens = len(input_token_index) + 1
num_decoder_tokens = len(target_token_index) + 1

print('num_encoder_tokens: ' + str(num_encoder_tokens))
print('num_decoder_tokens: ' + str(num_decoder_tokens))

num_encoder_tokens: 28
num_decoder_tokens: 47


In [114]:
input_texts[5000]

'i have to dress up'

In [115]:
target_texts[5000]

'\tאני צריכה להתלבש\n'

In [116]:
decoder_input_seq[5000, :]

array([ 5,  7,  9,  1,  8, 17,  7,  2, 14,  2,  1,  9,  2,  1, 14, 14, 21,
       15,  4,  6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0], dtype=int32)

#### text->seq : Test set

In [117]:
test_encoder_input_seq, test_input_token_index = text2sequences_heb(max_encoder_seq_length,
                                                      test_input_texts, lang="Eng")
test_decoder_input_seq, test_target_token_index = text2sequences_heb(max_decoder_seq_length,
                                                       test_target_texts, lang="Heb")


print('shape of encoder_input_seq: ' + str(test_encoder_input_seq.shape))
print('shape of input_token_index: ' + str(len(test_input_token_index)))

print('shape of decoder_input_seq: ' + str(test_decoder_input_seq.shape))
print('shape of target_token_index: ' + str(len(test_target_token_index)))

shape of encoder_input_seq: (7000, 20)
shape of input_token_index: 27
shape of decoder_input_seq: (7000, 44)
shape of target_token_index: 32


### One-hot encode
(Replace with generator)


In [118]:
from keras.utils import to_categorical
import numpy
# one hot encode target sequence
def onehot_encode(sequences, max_len, vocab_size):
    n = len(sequences)
    data = numpy.zeros((n, max_len, vocab_size))
    for i in range(n):
        data[i, :, :] = to_categorical(sequences[i], num_classes=vocab_size)
    return data

#### OHE - Training set

In [119]:
encoder_input_data = onehot_encode(encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
decoder_input_data = onehot_encode(decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

decoder_target_seq = numpy.zeros(decoder_input_seq.shape)
decoder_target_seq[:, 0:-1] = decoder_input_seq[:, 1:]
decoder_target_data = onehot_encode(decoder_target_seq,
                                    max_decoder_seq_length,
                                    num_decoder_tokens)

print(encoder_input_data.shape)
print(decoder_input_data.shape)

(28000, 20, 28)
(28000, 44, 47)


#### OHE - Test set

In [120]:
test_encoder_input_data = onehot_encode(test_encoder_input_seq, max_encoder_seq_length, num_encoder_tokens)
test_decoder_input_data = onehot_encode(test_decoder_input_seq, max_decoder_seq_length, num_decoder_tokens)

test_decoder_target_seq = numpy.zeros(test_decoder_input_seq.shape)
test_decoder_target_seq[:, 0:-1] = test_decoder_input_seq[:, 1:]
test_decoder_target_data = onehot_encode(test_decoder_target_seq,
                                    max_decoder_seq_length,
                                    num_decoder_tokens)

print(test_encoder_input_data.shape)
print(test_decoder_input_data.shape)

(7000, 20, 28)
(7000, 44, 47)


## Building Networks, Training and Prediction

### a) LSTM Seq2Seq model

Combinations tried:

- Latent dimensions - 128/256/512
- training epochs - 10, 25, 50
- dropout = 0.2, 0.5
- activation function - RMS, Adam
- sampling - greedy, multinomial (different temperatures)

#### a1) Encoder network

In [130]:
from keras.layers import Input, LSTM
from keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens),
                       name='encoder_inputs')

# set the LSTM layer
encoder_lstm = LSTM(latent_dim, return_state=True,
                    dropout=0.5, name='encoder_lstm')
_, state_h, state_c = encoder_lstm(encoder_inputs)

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs,
                      outputs=[state_h, state_c],
                      name='encoder')

In [131]:
encoder_model.summary()

Model: "encoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_inputs (InputLayer  [(None, None, 28)]        0         
 )                                                               
                                                                 
 encoder_lstm (LSTM)         [(None, 256),             291840    
                              (None, 256),                       
                              (None, 256)]                       
                                                                 
Total params: 291840 (1.11 MB)
Trainable params: 291840 (1.11 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### a2) Decoder network

In [132]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim, return_sequences=True,
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x,
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

In [133]:
decoder_model.summary()


Model: "decoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_h (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_c (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                         

#### a3) Connect the encoder and decoder

In [134]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x],
              outputs=decoder_pred,
              name='model_training')

In [135]:
print(state_h)
print(decoder_input_h)

KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.float32, name=None), name='decoder_lstm/PartitionedCall:2', description="created by layer 'decoder_lstm'")
KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.float32, name='decoder_input_h'), name='decoder_input_h', description="created by layer 'decoder_input_h'")


In [136]:
model.summary()

Model: "model_training"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_input_x (InputLaye  [(None, None, 28)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 encoder (Functional)        [(None, 256),                291840    ['encoder_input_x[0][0]']     
                              (None, 256)]                                           

#### a4) Fit the model on the bilingual dataset

In [137]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(28000, 20, 28)
shape of decoder_input_data(28000, 44, 47)
shape of decoder_target_data(28000, 44, 47)


In [138]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=100, validation_split=0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x786ce63f6a70>

In [139]:
model.save('seq2seq.keras')

#### a5) Make predictions

In [140]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [141]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=3)
        # greedy selection
        # sampled_token_index = numpy.argmax(output_tokens[0, -1, :])

        # multinomial sampling with temperature
        temperature=0.2
        pred = output_tokens[0, -1, :] ** (1.0 / temperature)
        temp_pred = pred / np.sum(pred)
        sampled_token_index = np.random.choice(len(output_tokens[0, -1, :]), p=temp_pred)

        sampled_char = reverse_target_char_index[sampled_token_index+1]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]
        decoded_sentence = decoded_sentence[::-1]
        #decoded_sentence = decoded_sentence.strip()
    return decoded_sentence


In [142]:
for seq_index in range(5100, 5102):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    print('-')
    print('English:       ', test_input_texts[seq_index])
    print('Your target language (true): ', target_texts[seq_index][1:-1])
    print('Your target language (pred): ', decoded_sentence[0:-1])


-
English:        enjoy the game
Your target language (true):  אתה יכול לטלפן לו
Your target language (pred):  אחם		בלהיויה
-
English:        ill do anything
Your target language (true):  אל תחששי לי
Your target language (pred):  ןע	ו	ניטםותריהככיעמינ


#### a6) Translate an English sentence to the target language

In [143]:
input_sentence = 'I love you'

input_tokens = [char for char in input_sentence]

input_sequence = text2sequences_heb(max_encoder_seq_length, [input_tokens], lang="Eng")[0]

input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)

translated_sentence = decode_sequence(input_x)
translated_sentence = translated_sentence.strip()

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)


source sentence is: I love you
translated sentence is: קםמוננז	ת	הילוי	היתא


#### a7) Evaluate the translation using BLEU score (5 samples)

In [144]:
from nltk.translate.bleu_score import sentence_bleu

individual_bleu_scores = []

for seq_index in range(5000, 5005):

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    actual_sentence = test_target_texts[seq_index][1:-1]
    print("Decoded sentence: ", decoded_sentence)
    print("Actual sentence: ", actual_sentence)
    candidate = decoded_sentence.split()
    references = actual_sentence.split()
    # Calculate BLEU score for this sentence
    bleu_score = sentence_bleu(references, candidate, weights=(1, 0, 0, 0))
    print('Cumulative 1-gram: %f' % sentence_bleu(references, candidate, weights=(1, 0, 0, 0)))
    print('Cumulative 2-gram: %f' % sentence_bleu(references, candidate, weights=(0.5, 0.5, 0, 0)))
    print('Cumulative 3-gram: %f' % sentence_bleu(references, candidate, weights=(0.33, 0.33, 0.33, 0)))
    print('Cumulative 4-gram: %f' % sentence_bleu(references, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
    individual_bleu_scores.append(bleu_score)

# Calculate the average BLEU score
average_bleu_score = sum(individual_bleu_scores) / len(individual_bleu_scores)

print("Average BLEU Score on the Test Set:", average_bleu_score)


Decoded sentence:  יתלענמעיטםותריהככהיכהילדא
Actual sentence:  תום תמיד משקר
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  קז	יחםותריהמנלוא
Actual sentence:  אתה יכול להביא את זה
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אלתלשםת	ויהמיהו
Actual sentence:  אני אוהב אותם
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  מז	יחםותריהמנלוא
Actual sentence:  לעולם אינני דואג
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אהם		ת	היויתמ
Actual sentence:  אנו לא מוכנות
Cumulative 1-gram: 0.333333
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Average BLEU Score on the Test Set: 0.06666666666666667


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


### b) Bi-LSTM Seq2Seq model

Combinations tried:

- Latent dimensions - 128/256/512
- training epochs - 10, 25, 50
- dropout = 0.2, 0.5
- activation function - RMS, Adam
- sampling - greedy, multinomial (different temperatures)

#### b1) Encoder network

In [145]:
from keras.layers import Bidirectional, Concatenate
from keras.layers import Input, LSTM
from keras.models import Model

latent_dim = 128

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens),
                       name='encoder_inputs')

encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True,
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

encoder_model = Model(inputs=encoder_inputs,
                      outputs=[state_h, state_c],
                      name='encoder')

In [146]:
encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None, 28)]           0         []                            
 )                                                                                                
                                                                                                  
 bidirectional_1 (Bidirecti  [(None, 256),                160768    ['encoder_inputs[0][0]']      
 onal)                        (None, 128),                                                        
                              (None, 128),                                                        
                              (None, 128),                                                        
                              (None, 128)]                                                  

#### b2) Decoder network

In [147]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim*2, return_sequences=True,
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x,
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

In [148]:
decoder_model.summary()


Model: "decoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_h (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_c (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                         

#### b3) Connect the encoder and decoder

In [149]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x],
              outputs=decoder_pred,
              name='model_training')

In [150]:
print(state_h)
print(decoder_input_h)

KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.float32, name=None), name='decoder_lstm/PartitionedCall:2', description="created by layer 'decoder_lstm'")
KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.float32, name='decoder_input_h'), name='decoder_input_h', description="created by layer 'decoder_input_h'")


In [151]:
model.summary()

Model: "model_training"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_input_x (InputLaye  [(None, None, 28)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 encoder (Functional)        [(None, 256),                160768    ['encoder_input_x[0][0]']     
                              (None, 256)]                                           

#### b4) Fit the model on the bilingual dataset

In [152]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(28000, 20, 28)
shape of decoder_input_data(28000, 44, 47)
shape of decoder_target_data(28000, 44, 47)


In [153]:
model.compile(optimizer='adam', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=100, validation_split=0.2)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.src.callbacks.History at 0x786ce6d9ac80>

In [None]:
model.save('seq2se_bilstm.keras')

#### b5) Make predictions

In [154]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [155]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=3)
        # greedy selection
        # sampled_token_index = numpy.argmax(output_tokens[0, -1, :])

        # multinomial sampling with temperature
        temperature=0.2
        pred = output_tokens[0, -1, :] ** (1.0 / temperature)
        temp_pred = pred / np.sum(pred)
        sampled_token_index = np.random.choice(len(output_tokens[0, -1, :]), p=temp_pred)

        sampled_char = reverse_target_char_index[sampled_token_index+1]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]
        decoded_sentence = decoded_sentence[::-1]

    return decoded_sentence


In [156]:
for seq_index in range(1200, 1202):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    print('-')
    print('English:       ', test_input_texts[seq_index])
    print('Your target language (true): ', target_texts[seq_index][1:-1])
    print('Your target language (pred): ', decoded_sentence[0:-1])

-
English:        tom ate
Your target language (true):  אנו משוכנעים
Your target language (pred):  אהשעטםותריהוים
-
English:        tell the truth
Your target language (true):  זה עוזר
Your target language (pred):  נהם		בלהיוישנ


#### b6) Translate an English sentence to the target language

In [157]:
input_sentence = 'I love you'

input_tokens = [char for char in input_sentence]

input_sequence = text2sequences_heb(max_encoder_seq_length, [input_tokens], lang="Eng")[0]

input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)

translated_sentence = decode_sequence(input_x)
translated_sentence = translated_sentence.strip()

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)


source sentence is: I love you
translated sentence is: אחם		בלהיויהף


#### b7) Evaluate the translation using BLEU score (for 5 samples)


In [158]:
from nltk.translate.bleu_score import sentence_bleu

individual_bleu_scores = []

for seq_index in range(5000, 5005):

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    actual_sentence = test_target_texts[seq_index][1:-1]
    print("Decoded sentence: ", decoded_sentence)
    print("Actual sentence: ", actual_sentence)
    candidate = decoded_sentence.split()
    references = actual_sentence.split()
    # Calculate BLEU score for this sentence
    bleu_score = sentence_bleu(references, candidate, weights=(1, 0, 0, 0))
    print('Cumulative 1-gram: %f' % sentence_bleu(references, candidate, weights=(1, 0, 0, 0)))
    print('Cumulative 2-gram: %f' % sentence_bleu(references, candidate, weights=(0.5, 0.5, 0, 0)))
    print('Cumulative 3-gram: %f' % sentence_bleu(references, candidate, weights=(0.33, 0.33, 0.33, 0)))
    print('Cumulative 4-gram: %f' % sentence_bleu(references, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
    individual_bleu_scores.append(bleu_score)

# Calculate the average BLEU score
average_bleu_score = sum(individual_bleu_scores) / len(individual_bleu_scores)

print("Average BLEU Score on the Test Set:", average_bleu_score)


Decoded sentence:  אונלטטםותרילככיםב
Actual sentence:  תום תמיד משקר
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אוחיחנזחםותריהמינלכעה
Actual sentence:  אתה יכול להביא את זה
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אהם		ת	היוישז
Actual sentence:  אני אוהב אותם
Cumulative 1-gram: 0.333333
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אשםעטםותרימויהמ
Actual sentence:  לעולם אינני דואג
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אהם		ת	היוישז
Actual sentence:  אנו לא מוכנות
Cumulative 1-gram: 0.333333
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Average BLEU Score on the Test Set: 0.13333333333333333


### c) Bi-LSTM Seq2Seq model - Hebrew pre-padding, more training epochs and higher dimensions in hidden state

Combinations used:

- Latent dimension - 512
- training epochs - 75
- dropout = 0.5
- activation function - Adam
- sampling - multinomial (temperature = 0.25)

#### c1) Encoder network

In [159]:
from keras.layers import Bidirectional, Concatenate
from keras.layers import Input, LSTM
from keras.models import Model

latent_dim = 512

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens),
                       name='encoder_inputs')

encoder_bilstm = Bidirectional(LSTM(latent_dim, return_state=True,
                                  dropout=0.5, name='encoder_lstm'))
_, forward_h, forward_c, backward_h, backward_c = encoder_bilstm(encoder_inputs)

state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])

encoder_model = Model(inputs=encoder_inputs,
                      outputs=[state_h, state_c],
                      name='encoder')

In [160]:
encoder_model.summary()

Model: "encoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_inputs (InputLayer  [(None, None, 28)]           0         []                            
 )                                                                                                
                                                                                                  
 bidirectional_2 (Bidirecti  [(None, 1024),               2215936   ['encoder_inputs[0][0]']      
 onal)                        (None, 512),                                                        
                              (None, 512),                                                        
                              (None, 512),                                                        
                              (None, 512)]                                                  

#### c2) Decoder network

In [161]:
from keras.layers import Input, LSTM, Dense
from keras.models import Model

# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim*2,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim*2,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# set the LSTM layer
decoder_lstm = LSTM(latent_dim*2, return_sequences=True,
                    return_state=True, dropout=0.5, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x,
                                                      initial_state=[decoder_input_h, decoder_input_c])

# set the dense layer
decoder_dense = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense')
decoder_outputs = decoder_dense(decoder_lstm_outputs)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs, state_h, state_c],
                      name='decoder')

In [162]:
decoder_model.summary()


Model: "decoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_h (InputLaye  [(None, 1024)]               0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_c (InputLaye  [(None, 1024)]               0         []                            
 r)                                                                                         

#### c3) Connect the encoder and decoder

In [163]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x],
              outputs=decoder_pred,
              name='model_training')

In [164]:
print(state_h)
print(decoder_input_h)

KerasTensor(type_spec=TensorSpec(shape=(None, 1024), dtype=tf.float32, name=None), name='decoder_lstm/PartitionedCall:2', description="created by layer 'decoder_lstm'")
KerasTensor(type_spec=TensorSpec(shape=(None, 1024), dtype=tf.float32, name='decoder_input_h'), name='decoder_input_h', description="created by layer 'decoder_input_h'")


In [165]:
model.summary()

Model: "model_training"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_input_x (InputLaye  [(None, None, 28)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 encoder (Functional)        [(None, 1024),               2215936   ['encoder_input_x[0][0]']     
                              (None, 1024)]                                          

#### c4) Fit the model on the bilingual dataset

In [166]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(28000, 20, 28)
shape of decoder_input_data(28000, 44, 47)
shape of decoder_target_data(28000, 44, 47)


In [167]:
model.compile(optimizer='adam', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=64, epochs=200, validation_split=0.2)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.src.callbacks.History at 0x786cd8c6df30>

In [None]:
model.save('seq2seq_bilstm_rev.keras')

#### c5) Make predictions

In [168]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [169]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=3)
        # greedy selection
        sampled_token_index = numpy.argmax(output_tokens[0, -1, :])

        # multinomial sampling with temperature
        #temperature=0.25
        #pred = output_tokens[0, -1, :] ** (1.0 / temperature)
        #temp_pred = pred / np.sum(pred)
        #sampled_token_index = np.random.choice(len(output_tokens[0, -1, :]), p=temp_pred)

        sampled_char = reverse_target_char_index[sampled_token_index+1]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]
        decoded_sentence = decoded_sentence[::-1]
    return decoded_sentence


In [170]:
for seq_index in range(500, 502):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    print('-')
    print('English:       ', test_input_texts[seq_index])
    print('Your target language (true): ', target_texts[seq_index][1:-1])
    print('Your target language (pred): ', decoded_sentence[0:-1])

-
English:        how are the kids
Your target language (true):  זה פוגעני
Your target language (pred):  אהם		בלהיויש
-
English:        come if you can
Your target language (true):  זהו את תום
Your target language (pred):  אהם		בלהיויש


#### c6) Translate an English sentence to the target language

In [171]:
input_sentence = 'I love you'

input_tokens = [char for char in input_sentence]

input_sequence = text2sequences_heb(max_encoder_seq_length, [input_tokens], lang="Eng")[0]

input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)

translated_sentence = decode_sequence(input_x)

translated_sentence = translated_sentence.strip()

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)


source sentence is: I love you
translated sentence is: אםם		ב	היויהכ


#### c7) Evaluate the translation using BLEU score (for 5 samples)


In [172]:
from nltk.translate.bleu_score import sentence_bleu

individual_bleu_scores = []

for seq_index in range(5000, 5005):

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    actual_sentence = test_target_texts[seq_index][1:-1]
    print("Decoded sentence: ", decoded_sentence)
    print("Actual sentence: ", actual_sentence)
    candidate = decoded_sentence.split()
    references = actual_sentence.split()
    # Calculate BLEU score for this sentence
    bleu_score = sentence_bleu(references, candidate, weights=(1, 0, 0, 0))
    print('Cumulative 1-gram: %f' % sentence_bleu(references, candidate, weights=(1, 0, 0, 0)))
    print('Cumulative 2-gram: %f' % sentence_bleu(references, candidate, weights=(0.5, 0.5, 0, 0)))
    print('Cumulative 3-gram: %f' % sentence_bleu(references, candidate, weights=(0.33, 0.33, 0.33, 0)))
    print('Cumulative 4-gram: %f' % sentence_bleu(references, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
    individual_bleu_scores.append(bleu_score)

# Calculate the average BLEU score
average_bleu_score = sum(individual_bleu_scores) / len(individual_bleu_scores)

print("Average BLEU Score on the Test Set:", average_bleu_score)


Decoded sentence:  אהם		בלהיוישכ
Actual sentence:  תום תמיד משקר
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אהם		ב	היוישמ
Actual sentence:  אתה יכול להביא את זה
Cumulative 1-gram: 0.333333
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  שךמש		ת	ויוישיהא
Actual sentence:  אני אוהב אותם
Cumulative 1-gram: 0.333333
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אהם		בלהיויהכ
Actual sentence:  לעולם אינני דואג
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אהם		בלהיוישכ
Actual sentence:  אנו לא מוכנות
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Average BLEU Score on the Test Set: 0.13333333333333333


### d) LSTM Seq2Seq model with Attention


(Source: https://saturncloud.io/blog/add-attention-mechanism-to-an-lstm-model-in-keras/)

#### d1) Encoder network

In [173]:
from keras.layers import Input, LSTM
from keras.models import Model

latent_dim = 256

# inputs of the encoder network
encoder_inputs = Input(shape=(None, num_encoder_tokens),
                       name='encoder_inputs')

# set the LSTM layer
encoder_lstm = LSTM(latent_dim, return_state=True,
                    dropout=0.2, name='encoder_lstm')
encoder_lstm_output, state_h, state_c = encoder_lstm(encoder_inputs)

# build the encoder network model
encoder_model = Model(inputs=encoder_inputs,
                      outputs=[state_h, state_c],
                      name='encoder')

In [174]:
encoder_model.summary()

Model: "encoder"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 encoder_inputs (InputLayer  [(None, None, 28)]        0         
 )                                                               
                                                                 
 encoder_lstm (LSTM)         [(None, 256),             291840    
                              (None, 256),                       
                              (None, 256)]                       
                                                                 
Total params: 291840 (1.11 MB)
Trainable params: 291840 (1.11 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


#### d2) Decoder network with Attention

In [175]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Attention, Concatenate


# inputs of the decoder network
decoder_input_h = Input(shape=(latent_dim,), name='decoder_input_h')
decoder_input_c = Input(shape=(latent_dim,), name='decoder_input_c')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')


# Decoder LSTM
decoder_lstm = LSTM(latent_dim, return_sequences=True,
                    return_state=True, dropout=0.2, name='decoder_lstm')
decoder_lstm_outputs, state_h, state_c = decoder_lstm(decoder_input_x,
                                                      initial_state=[decoder_input_h, decoder_input_c])

# Attention layer
attention = Attention()
attention_output = attention([decoder_lstm_outputs, decoder_input_h])


# set the dense layer
decoder_dense_with_attention = Dense(num_decoder_tokens, activation='softmax', name='decoder_dense_with_attention')
decoder_outputs_with_attention = decoder_dense_with_attention(attention_output)

# build the decoder network model
decoder_model = Model(inputs=[decoder_input_x, decoder_input_h, decoder_input_c],
                      outputs=[decoder_outputs_with_attention, state_h, state_c],
                      name='decoder')

In [176]:
decoder_model.summary()

Model: "decoder"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_h (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_c (InputLaye  [(None, 256)]                0         []                            
 r)                                                                                         

#### d3) Connect the encoder and decoder

In [177]:
# input layers
encoder_input_x = Input(shape=(None, num_encoder_tokens), name='encoder_input_x')
decoder_input_x = Input(shape=(None, num_decoder_tokens), name='decoder_input_x')

# connect encoder to decoder
encoder_final_states = encoder_model([encoder_input_x])
decoder_lstm_output, _, _ = decoder_lstm(decoder_input_x, initial_state=encoder_final_states)
decoder_pred = decoder_dense_with_attention(decoder_lstm_output)

model = Model(inputs=[encoder_input_x, decoder_input_x],
              outputs=decoder_pred,
              name='model_training')

In [178]:
print(state_h)
print(decoder_input_h)

KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.float32, name=None), name='decoder_lstm/PartitionedCall:2', description="created by layer 'decoder_lstm'")
KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.float32, name='decoder_input_h'), name='decoder_input_h', description="created by layer 'decoder_input_h'")


In [179]:
model.summary()

Model: "model_training"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 encoder_input_x (InputLaye  [(None, None, 28)]           0         []                            
 r)                                                                                               
                                                                                                  
 decoder_input_x (InputLaye  [(None, None, 47)]           0         []                            
 r)                                                                                               
                                                                                                  
 encoder (Functional)        [(None, 256),                291840    ['encoder_input_x[0][0]']     
                              (None, 256)]                                           

#### d4) Fit the model on the bilingual dataset

In [180]:
print('shape of encoder_input_data' + str(encoder_input_data.shape))
print('shape of decoder_input_data' + str(decoder_input_data.shape))
print('shape of decoder_target_data' + str(decoder_target_data.shape))

shape of encoder_input_data(28000, 20, 28)
shape of decoder_input_data(28000, 44, 47)
shape of decoder_target_data(28000, 44, 47)


In [181]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

model.fit([encoder_input_data, decoder_input_data],  # training data
          decoder_target_data,                       # labels (left shift of the target sequences)
          batch_size=32, epochs=200, validation_split=0.2)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.src.callbacks.History at 0x786cd777eb30>

In [None]:
model.save('seq2seq_attention.keras')

#### d5) Make predictions

In [182]:
# Reverse-lookup token index to decode sequences back to something readable.
reverse_input_char_index = dict((i, char) for char, i in input_token_index.items())
reverse_target_char_index = dict((i, char) for char, i in target_token_index.items())

In [183]:
def decode_sequence(input_seq):
    states_value = encoder_model.predict(input_seq)
    target_seq = numpy.zeros((1, 1, num_decoder_tokens))
    target_seq[0, 0, target_token_index['\t']] = 1.

    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value, verbose=3)
        # greedy selection
        # sampled_token_index = numpy.argmax(output_tokens[0, -1, :])

        # multinomial sampling with temperature
        temperature=0.2
        pred = output_tokens[0, -1, :] ** (1.0 / temperature)
        temp_pred = pred / np.sum(pred)
        sampled_token_index = np.random.choice(len(output_tokens[0, -1, :]), p=temp_pred)

        sampled_char = reverse_target_char_index[sampled_token_index+1]
        decoded_sentence += sampled_char

        if (sampled_char == '\n' or
           len(decoded_sentence) > max_decoder_seq_length):
            stop_condition = True

        target_seq = numpy.zeros((1, 1, num_decoder_tokens))
        target_seq[0, 0, sampled_token_index] = 1.

        states_value = [h, c]
        decoded_sentence = decoded_sentence[::-1]

    return decoded_sentence


In [184]:
for seq_index in range(5100, 5102):
    # Take one sequence (part of the training set)
    # for trying out decoding.
    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    print('-')
    print('English:       ', test_input_texts[seq_index])
    print('Your target language (true): ', target_texts[seq_index][1:-1])
    print('Your target language (pred): ', decoded_sentence[0:-1])


-
English:        enjoy the game
Your target language (true):  אתה יכול לטלפן לו
Your target language (pred):  א	חניתי	דתלוע	ח	
-
English:        ill do anything
Your target language (true):  אל תחששי לי
Your target language (pred):  עימפי	דתלוהק


#### d6) Translate an English sentence to the target language

In [185]:
input_sentence = 'I love you'

input_tokens = [char for char in input_sentence]

input_sequence = text2sequences_heb(max_encoder_seq_length, [input_tokens], lang="Eng")[0]

input_x = onehot_encode(input_sequence, max_encoder_seq_length, num_encoder_tokens)

translated_sentence = decode_sequence(input_x)
translated_sentence = translated_sentence.strip()

print('source sentence is: ' + input_sentence)
print('translated sentence is: ' + translated_sentence)


source sentence is: I love you
translated sentence is: אןותיתיתומולוע


#### d7) Evaluate the translation using BLEU score (5 samples)

In [186]:
from nltk.translate.bleu_score import sentence_bleu

individual_bleu_scores = []

for seq_index in range(4000, 4005):

    input_seq = test_encoder_input_data[seq_index: seq_index + 1]
    decoded_sentence = decode_sequence(input_seq)
    decoded_sentence = decoded_sentence.strip()
    actual_sentence = test_target_texts[seq_index][1:-1]
    print("Decoded sentence: ", decoded_sentence)
    print("Actual sentence: ", actual_sentence)
    candidate = decoded_sentence.split()
    references = actual_sentence.split()
    # Calculate BLEU score for this sentence
    bleu_score = sentence_bleu(references, candidate, weights=(1, 0, 0, 0))
    print('Cumulative 1-gram: %f' % sentence_bleu(references, candidate, weights=(1, 0, 0, 0)))
    print('Cumulative 2-gram: %f' % sentence_bleu(references, candidate, weights=(0.5, 0.5, 0, 0)))
    print('Cumulative 3-gram: %f' % sentence_bleu(references, candidate, weights=(0.33, 0.33, 0.33, 0)))
    print('Cumulative 4-gram: %f' % sentence_bleu(references, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
    individual_bleu_scores.append(bleu_score)

# Calculate the average BLEU score
average_bleu_score = sum(individual_bleu_scores) / len(individual_bleu_scores)

print("Average BLEU Score on the Test Set:", average_bleu_score)


Decoded sentence:  אלהםיורתתיויורוא
Actual sentence:  טום היה חמקמק
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  א	י	רתל	טכג
Actual sentence:  תום לא רגיש
Cumulative 1-gram: 0.500000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  עימפי	דתלוהקא
Actual sentence:  תום קרץ בחזרה
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  אתי	רת	ויתלא
Actual sentence:  אני לא סובל זוחלים
Cumulative 1-gram: 0.000000
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Decoded sentence:  א	הזיורתתויתהליש
Actual sentence:  איפה הספר
Cumulative 1-gram: 0.183940
Cumulative 2-gram: 0.000000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000
Average BLEU Score on the Test Set: 0.13678794411714423
