# Machine Translation

Machine translation as the name suggests is the process of conversion of data from one language to another with out the intervention of a human being. In other word it is the process by which a machine is taught how to convert text from one language to another. This is one of the most researched areas of Artificial Intelligence.

# Environment

-anaconda
-venv etc

# Dependencies

In [3]:
import keras
import pydotplus
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.callbacks import ModelCheckpoint
import string
import re
from pickle import dump,load
from unicodedata import normalize
from numpy import array
from numpy.random import rand
from numpy.random import shuffle


Using TensorFlow backend.


# Loading Data

The data set I have used for this particular consists of sentenses in English with their translations in German.

In [2]:
loc = '/home/codersarts/Desktop/data/deu.txt'
destination = '/home/codersarts/Desktop/data/dg.pkl'
destination_full  = '/home/codersarts/Desktop/data/dg_full.pkl'
destination_train = '/home/codersarts/Desktop/data/dg_train.pkl'
destination_test  = '/home/codersarts/Desktop/data/dg_test.pkl'
file = open(loc, mode='rt', encoding='utf-8')
text = file.read()
file.close()

# Data Preprocessing

The data we have is unstructured and needs to be structured for further processing. It consists of a lot of noise. In this step we remove the noise and the resultant would be key-value paires for sentences in english and german.

In [13]:
lines = text.strip().split('\n')
pairs = [line.split('\t') for line in  lines]
cleaned = list()
re_print = re.compile('[^%s]' % re.escape(string.printable))
table = str.maketrans('', '', string.punctuation)
for pair in pairs:
        clean_pair = list()
        for line in pair:
            line = normalize('NFD', line).encode('ascii', 'ignore')
            line = line.decode('UTF-8')
            line = line.split()
            line = [word.lower() for word in line]
            line = [word.translate(table) for word in line]
            line = [re_print.sub('', w) for w in line]
            line = [word for word in line if word.isalpha()]
            clean_pair.append(' '.join(line))
        cleaned.append(clean_pair)
cleaned = array(cleaned)

In [16]:
for i in range(10):
    print('%s --> %s' % (cleaned[i,0], cleaned[i,1]))

hi --> hallo
hi --> gru gott
run --> lauf
wow --> potzdonner
wow --> donnerwetter
fire --> feuer
help --> hilfe
help --> zu hulf
stop --> stopp
wait --> warte


In [18]:
#dump(cleaned, open(destination, 'wb'))

# Train Test Split

Once we have clean data we need to divide this data into traing and test so that we can traing our model with training data and then further test in on test data respectively. I have taken 9000 instances of data as training samples and 1000 as test samples and then have saved them to the local memory as pickle files.

In [23]:
raw_data = load(open(destination,'rb'))
max_count = 10000

dataset = raw_data[:max_count, :]
shuffle(dataset)

train, test = dataset[:9000], dataset[9000:]

dump(dataset, open(destination_full , 'wb'))
dump(train, open(destination_train, 'wb'))
dump(test, open(destination_test , 'wb'))


In [8]:
dataset = load(open(destination_full, 'rb'))
train = load(open(destination_train, 'rb'))
test = load(open(destination_test, 'rb'))
def max_length(lines):
    return max(len(line.split()) for line in lines)

# Tokenization & Encoding

Tokenization is the process of spliting the data into tokens, word tokens or sent tokens. Seperate tokenizers are used for german and for english data. We can do this using tokenizer from keras functional api. Once the data data is tokenized then we need to turn it into sequences and pad these sequences. Again we do this using keras functional api. We have to do this for both training as well as test data. Also we need to encode our targets as categorical data.

In [10]:
# English tokenizer
tokenizerEN = Tokenizer()
tokenizerEN.fit_on_texts(dataset[:,0])
eng_vocab_size = len(tokenizerEN.word_index)+1
eng_length = max_length(dataset[:, 0])

print('English Vocabulary Size: %d' % eng_vocab_size)
print('English Max Length: %d' % (eng_length))

English Vocabulary Size: 2200
English Max Length: 5


In [12]:
# German tokenizer
tokenizerGE = Tokenizer()
tokenizerGE.fit_on_texts(dataset[:,1])
ger_vocab_size = len(tokenizerGE.word_index)+1
ger_length = max_length(dataset[:, 1])

print('German Vocabulary Size: %d' % ger_vocab_size)
print('German Max Length: %d' % (ger_length)) 

German Vocabulary Size: 3529
German Max Length: 9


In [15]:
# encode/pad sequences for train
x = tokenizerGE.texts_to_sequences(train[:,1])
TrainX = pad_sequences(x , maxlen = ger_length , padding = 'post')

y = tokenizerEN.texts_to_sequences(train[:,0])
trainY = pad_sequences(y , maxlen = eng_length , padding = 'post')

yl = list()
for seq in trainY:
    encod = to_categorical(seq , num_classes = eng_vocab_size)
    yl.append(encod)
var = array(yl)
TrainY = var.reshape(trainY.shape[0], trainY.shape[1], eng_vocab_size)

In [17]:
# encode/pad sequences for test
x = tokenizerGE.texts_to_sequences(test[:,1])
TestX = pad_sequences(x , maxlen = ger_length , padding = 'post')

y = tokenizerEN.texts_to_sequences(test[:,0])
testY = pad_sequences(y , maxlen = eng_length , padding = 'post')

yl = list()
for seq in testY:
    encod = to_categorical(seq , num_classes = eng_vocab_size)
    yl.append(encod)
var = array(yl)
TestY = var.reshape(testY.shape[0], testY.shape[1], eng_vocab_size)

# Model Design

In [40]:
model = Sequential()
model.add(Embedding(ger_vocab_size, 256, input_length = ger_length, mask_zero=True))
model.add(LSTM(256))
model.add(RepeatVector(eng_length))
model.add(LSTM(256, return_sequences=True))
model.add(TimeDistributed(Dense(eng_vocab_size, activation='softmax')))
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(TrainX, TrainY, epochs=29, batch_size=64, validation_data=(TestX, TestY))

Train on 9000 samples, validate on 1000 samples
Epoch 1/29
Epoch 2/29
Epoch 3/29
Epoch 4/29
Epoch 5/29
Epoch 6/29
Epoch 7/29
Epoch 8/29
Epoch 9/29
Epoch 10/29
Epoch 11/29
Epoch 12/29
Epoch 13/29
Epoch 14/29
Epoch 15/29
Epoch 16/29
Epoch 17/29
Epoch 18/29
Epoch 19/29
Epoch 20/29
Epoch 21/29
Epoch 22/29
Epoch 23/29
Epoch 24/29
Epoch 25/29
Epoch 26/29
Epoch 27/29
Epoch 28/29
Epoch 29/29


<keras.callbacks.History at 0x7fbf124bcdd0>

# Model Architecture

The architecture for this model is shown below. The layers included are first the embedding layer for the extraction of latent features from the textual data. Then comes the encoder Long Short Term Memory layer. On top of LSTM is the repeat_vector to just repeat the input vector n times. Then comes another LSTM and finaly time distributed layer to wrap the dense layer.

In [39]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 9, 256)            903424    
_________________________________________________________________
lstm_7 (LSTM)                (None, 256)               525312    
_________________________________________________________________
repeat_vector_4 (RepeatVecto (None, 9, 256)            0         
_________________________________________________________________
lstm_8 (LSTM)                (None, 9, 256)            525312    
_________________________________________________________________
time_distributed_4 (TimeDist (None, 9, 3529)           906953    
Total params: 2,861,001
Trainable params: 2,861,001
Non-trainable params: 0
_________________________________________________________________


In [38]:
keras.utils.vis_utils.pydot = pydot
plot_model(model, to_file='/home/codersarts/Desktop/data/NTMmodel.png')