In [1]:
import numpy as np
import pandas as pd
import os
import string
import re
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model
from string import digits

# English to Bulgarian Seq2Seq Neural Machine Translation

### A Deep Learning Project by Nikolay Nikolov

## Introduction

The following project is an implementation of the Neural Machine Translation (NMT) approach to machine translation from English to Bulgarian. NMT operates on the premise of predicting the likelihoods of a sequence of words and typically model entire sentences. That is the reason why larger datasets of translated sentences result in better models.

The way NMM works is by using vector representations for words and internal states. It features a sequence model that predicts one word at a time. Its prediciton is conditioned on the entire source sentence and what was already produced in the target sentence.

This model is based on a system of recurrent neural networks (RNN). It contains a first bidirectional RNN, known as an encoder, that encodes a source sentence, and a second RNN, known as a decoder, that predicts the words in the target language. It also uses a special architecture called LSTM (Long Short-Term Memory) that uses several layers in the repeating module in order to solve the long-term dependency problem, in which an RNN struggles to learn or make connections over a large information gap because of difficulty in encoding long inputs into a single vector.

The dataset that I am using for this project is a series of several tens of thousands of English sentences, taken from film subtitles, and their respective translations in Bulgarian. The data was provided by OpenSubtitles. Ideally this would not be the optimal data to use for such a machine translation project, as the vast majority of the phrases have not been translated literally, but instead subtitles are generally translated according to the general meaning of a phrase. However, I chose this project quite late and did not have the necessary time to build a quality dataset of literal translations between the two languages.

This is also the reason why I will be testing the model with phrases that it has already seen, instead with a free text, because of the relative lack of literal translations, as well as the relatively small size of the dataset (~ 40k phrases).

Let's start by importing our data and looking at a random sample:

In [2]:
corpus = pd.read_csv('/content/drive/MyDrive/enbg.txt', encoding = 'utf-8', sep = '\t', header = None)
corpus.columns = ['english', 'bulgarian']
corpus.sample(10)

Unnamed: 0,english,bulgarian
6615,Do your best.,Успех.
22480,My little starling.,Звездичке моя.
11986,I am doing this for our family.,Правя това за семейството ни.
14404,I see you're going in for jewelery.,"Гледам, че обичаш бижутата."
561,"Again, I am very sorry, is not your father Fel...","Пак силно се извинявам, баща ви е не е ли оня ..."
14533,"I suppose, in a sense, he's the first modern m...","По някакъв начин, той е първият модерен човек."
7663,Everything you need?,"Всичко, което ви трябва?"
26685,She was in the garden with mr.,"Беше в градината с г-н Доукър, сър."
33528,Was it changed?,Че кога са я подменили?
20533,let me show you who you're dealing with,Нека ви покажа с кого си имате работа.


Let's now investigate if there are any null values that should be taken care of:

In [3]:
pd.isnull(corpus).sum()

english      0
bulgarian    0
dtype: int64

Great! Now we will apply several operations on our corpus to make it suitable for a neural network. We will remove any duplicate values (if any exist), we'll lowercase all characters, as well as remove apostrophes and other punctuation, digits, as well as any excessive spaces.

In [4]:
corpus.drop_duplicates(inplace=True)

In [5]:
# Lowercase all capital letters
corpus['english']=corpus['english'].apply(lambda x: x.lower())
corpus['bulgarian']=corpus['bulgarian'].apply(lambda x: x.lower())

In [6]:
# Remove apostrophes
corpus['english']=corpus['english'].apply(lambda x: re.sub("'", '', x))
corpus['bulgarian']=corpus['bulgarian'].apply(lambda x: re.sub("'", '', x))

In [7]:
# Remove punctuation
exclude = set(string.punctuation)
corpus['english']=corpus['english'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
corpus['bulgarian']=corpus['bulgarian'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [8]:
# Remove digits
remove_digits = str.maketrans('', '', digits)
corpus['english']=corpus['english'].apply(lambda x: x.translate(remove_digits))
corpus['bulgarian']=corpus['bulgarian'].apply(lambda x: x.translate(remove_digits))

In [9]:
# Remove any leading and trailing
corpus['english']=corpus['english'].apply(lambda x: x.strip())
corpus['bulgarian']=corpus['bulgarian'].apply(lambda x: x.strip())
corpus['english']=corpus['english'].apply(lambda x: re.sub(" +", " ", x))
corpus['bulgarian']=corpus['bulgarian'].apply(lambda x: re.sub(" +", " ", x))

In [10]:
corpus.sample(10)

Unnamed: 0,english,bulgarian
40505,your going will not be necessary,няма да е нужно да заминаваш
39018,you flew over us several times you know very w...,прелетяхте над нас няколко пъти знаете много д...
1068,along the coastal regions an area about the si...,по крайбрежието район с размерите на великобри...
38509,you are kidding,какво шегуваш се
37635,without my robe i cannot return,не мога да се върна вкъщи без ангелското си об...
30918,there aint enough here for us,тук няма достатъчно дори за нас
22598,my wife made made this shrine,жена ми тя ли направи това
36987,whos wasting their time,кой на кого губи времето
5585,concentrate eunchae,ела на себе си ън че
13414,i have to go somewhere ill see you later,трябва ми чист въздух


Now let's add a start and an end token in the beginning and end of every target sentence. We need these because of the encoder-decoder structure. The model needs to know which token to treat as the beginning and which ones to treat as the follow-ups, as well as when to finalize the embedding.

In [11]:
corpus['bulgarian'] = corpus['bulgarian'].apply(lambda x : 'START_ '+ x + ' _END')

The following piece of code extracts all words in both sets and stores them in sets:

In [12]:
all_english_words = set()
for english in corpus['english']:
    for word in english.split():
        if word not in all_english_words:
            all_english_words.add(word)

all_bulgarian_words = set()
for bulgarian in corpus['bulgarian']:
    for word in bulgarian.split():
        if word not in all_bulgarian_words:
            all_bulgarian_words.add(word)

In [13]:
print(len(all_english_words))
print(len(all_bulgarian_words))

16654
28123


It is very interesting that the Bulgarian translations contain nearly twice as much unique words as the English phrases. This is likely because of the fact that Bulgarian has much more conjugation forms of verbs than English has, as well as definite articles that are attached to nouns, unlike English.

For sorting purposes, let's add two more columns that'll indicate the lenght in words of each English and Bulgarian sentence. Counted in the Bulgarian column are the start and end tokens of each sentence.

In [14]:
corpus['english_sentence_length'] = corpus['english'].apply(lambda x:len(x.split(" ")))
corpus['bulgarian_sentence_length'] = corpus['bulgarian'].apply(lambda x:len(x.split(" ")))

In [15]:
corpus

Unnamed: 0,english,bulgarian,english_sentence_length,bulgarian_sentence_length
0,a yearold girl needs her mother matt,START_ едно годишно момиче се нуждае от майка ...,7,11
1,a baby,START_ това е бебе _END,2,5
2,a balloon like that,START_ точно такова балонче _END,4,5
3,a bandit,START_ бандит _END,2,3
4,a bankrupt coal mine,START_ фалирала _END,4,3
...,...,...,...,...
41176,zoya emilievna you were amazing,START_ зоя прекрасно _END,5,4
41177,zoya its mayakovsky,START_ маяковски зоя _END,3,4
41178,zoya,START_ зоя _END,1,3
41179,zoya yes,START_ зоя _END,2,3


Now there can theoretically be sentences with any lenght in the dataset, but for the sake of optimization, let's set a threshold for the sentence lenght we'll use for our model. We want to keep only the sentences with lenght equal to or smaller than 20 words.

In [16]:
corpus = corpus[corpus['english_sentence_length']<=20]
corpus = corpus[corpus['bulgarian_sentence_length']<=20]

Here is the final form of the corpus that we'll be using:

In [17]:
corpus

Unnamed: 0,english,bulgarian,english_sentence_length,bulgarian_sentence_length
0,a yearold girl needs her mother matt,START_ едно годишно момиче се нуждае от майка ...,7,11
1,a baby,START_ това е бебе _END,2,5
2,a balloon like that,START_ точно такова балонче _END,4,5
3,a bandit,START_ бандит _END,2,3
4,a bankrupt coal mine,START_ фалирала _END,4,3
...,...,...,...,...
41176,zoya emilievna you were amazing,START_ зоя прекрасно _END,5,4
41177,zoya its mayakovsky,START_ маяковски зоя _END,3,4
41178,zoya,START_ зоя _END,1,3
41179,zoya yes,START_ зоя _END,2,3


We need to store the maximum values of the lenghts of sentences in both languages for array generation later on.

In [18]:
max_length_src = max(corpus['bulgarian_sentence_length'])
max_length_tar = max(corpus['english_sentence_length'])

In the next few variables we will store the input words, the target words, and the number of encoder and decoder tokens.

In [19]:
input_words = sorted(list(all_english_words))
target_words = sorted(list(all_bulgarian_words)) 
num_encoder_tokens = len(all_english_words)
num_decoder_tokens = len(all_bulgarian_words)
num_encoder_tokens, num_decoder_tokens

(16654, 28123)

We need to increase the values in the number of encoder and decoder tokens by one, for zero-padding purposes:

In [20]:
num_encoder_tokens += 1
num_decoder_tokens += 1

Now let's make a dictionary featuring the index of tokens in both languages:

In [21]:
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

As well as create a reverse-lookup token index to decode sentences into something readable at the output:

In [22]:
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

In [23]:
corpus = shuffle(corpus)
corpus

Unnamed: 0,english,bulgarian,english_sentence_length,bulgarian_sentence_length
31373,these knickknacks tsars shoulder cord,START_ тези дрънкулки царски акселбанти _END,5,6
19439,its very interesting to hear you talk about so...,START_ много интересно ти да говориш че някой ...,12,13
33713,we care about others frank,START_ не ни пука за другите франк _END,5,8
14483,i spoke to her,START_ говорих с нея _END,4,5
26496,she didnt shoot him,START_ не го е застреляла тя _END,4,7
...,...,...,...,...
375,about this wedding what am i supposed to report,START_ какво да докладвам за тази сватба _END,9,8
32423,this style wouldnt work on anyone but yoon right,START_ само юне може да носи такива неща права...,9,14
23152,no no not at all,START_ не разбира се нищо _END,5,6
9792,he kept on seeing that see lsnt that funny,START_ а той продължи да я вижда във въображен...,9,16


Let's now perform a train-test split in a 80:20 ratio:

In [24]:
X, y = corpus['english'], corpus['bulgarian']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape

((31723,), (7931,))

This next piece of code generates a batch of data:

In [25]:
def generate_batch(X = X_train, y = y_train, batch_size = 128):
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input sequence
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input sequence
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

In [26]:
latent_dim=300 # Number of nodes used as input

We'll now set up the encoder and decoder layers and compile the model:

In [27]:
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [28]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [29]:
model.compile(optimizer = 'rmsprop', loss = 'categorical_crossentropy')

In [30]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 300)    4996500     input_1[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, None, 300)    8437200     input_2[0][0]                    
______________________________________________________________________________________________

150 epochs is a pretty good training time for this model in order to produce decent results.

In [31]:
train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 128
epochs = 150

In [32]:
model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch = train_samples//batch_size,
                    epochs = epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples//batch_size)



Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

<tensorflow.python.keras.callbacks.History at 0x7f0cf01e32b0>

In [33]:
# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2 = dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)

In [34]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_END' or
           len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

## Results

In [35]:
train_gen = generate_batch(X_train, y_train, batch_size = 1)
k=-1

### Attempt 1:

In [36]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Bulgarian Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Bulgarian Translation:', decoded_sentence[:-4])

Input English sentence: now look what happened
Actual Bulgarian Translation:  виж какво стана 
Predicted Bulgarian Translation:  виж какво стана 


The first translation attempt matches exactly the actual translation that was provided with the dataset.

### Attempt 2:

In [37]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Bulgarian Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Bulgarian Translation:', decoded_sentence[:-4])

Input English sentence: could you leave us for a moment
Actual Bulgarian Translation:  ще ни оставите ли сами за малко 
Predicted Bulgarian Translation:  ще ни оставите ли за малко 


The second translation attempt is actually closer to the original sentence than to the provided dataset translation. The dataset translation includes the word "alone", which is not present in the English sentence and has correctly not been predicted in the model's translation attempt.

### Attempt 3:

In [38]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Bulgarian Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Bulgarian Translation:', decoded_sentence[:-4])

Input English sentence: hello
Actual Bulgarian Translation:  здравейте 
Predicted Bulgarian Translation:  хей ти 


The third translation attempt is considered valid because even though it does not directly translate the word "hello", what it translates is also a greeting, albeit much less formal ("hey you"). We can deduce that the model has done a decent job at least in capturing the semantics of the input.

### Attempt 4:

In [39]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Bulgarian Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Bulgarian Translation:', decoded_sentence[:-4])

Input English sentence: you just dont understand that
Actual Bulgarian Translation:  ти просто не го разбираш 
Predicted Bulgarian Translation:  ти просто разбираш ли го казваш 


In this attempt, the model has caught the meaning of the first part of the phrase, as well as of the verb, but has not grasped the negation. It has also added a supplementary verb that is not found in the original sentence.

### Attempt 5:

In [40]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Bulgarian Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Bulgarian Translation:', decoded_sentence[:-4])

Input English sentence: alice lock that door
Actual Bulgarian Translation:  алис заключи вратата 
Predicted Bulgarian Translation:  алис заключи вратата 


Again, a translation that matches perfectly the provided dataset translation.

### Attempt 6:

In [41]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Bulgarian Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Bulgarian Translation:', decoded_sentence[:-4])

Input English sentence: he will kill you on sight
Actual Bulgarian Translation:  ще те убие на мига 
Predicted Bulgarian Translation:  ще те убие 


In the final attempt, the model seems to have succeeded in translating only the first half of the phrase, ignoring the second half.

# Conclusion

Once again, this model is far from anything that can be used in any productive manner, but it is only for demonstration purposes how easy it is to implement such an RNN to accomplish such a task. If this work is expanded with a much larger dataset and more accurate original-translation pairs, a great model can be built that will even be able to handle free text.

# References

Lstm_seq2seq. (n.d.). Retrieved February 18, 2021, from https://keras.rstudio.com/articles/examples/lstm_seq2seq.html

Kalchbrenner, Nal; Blunsom, Philip (2013). "Recurrent Continuous Translation Models". _Proceedings of the Association for Computational Linguistics: 1700–1709_.

Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". _Neural Computation._ 9 (8): 1735–1780.