<a href="https://colab.research.google.com/github/kvamsi7/Neural-Machine-Translation/blob/main/scratchpad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [106]:
import string
import re
from numpy import array,argmax,random,take 
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Embedding,LSTM,RepeatVector
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import load_model
from tensorflow.keras import optimizers

import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_colwidth',20)

Data Gathering and Preparation

In [107]:
!!curl -O http://www.manythings.org/anki/fra-eng.zip
!!unzip fra-eng.zip

['Archive:  fra-eng.zip',
 'replace _about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n',
 'replace fra.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: n']

In [152]:
data_path = '/content/fra.txt'

with open(data_path,encoding='utf-8') as f:
  text = f.read()
text[:100]

'Go.\tVa !\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #1158250 (Wittydev)\nGo.\tMarche.'

In [109]:
# split the data into sequences

def to_lines(text):
  lines = text.strip().split('\n')
  sents = [line.split('\t')[:-1] for line in lines]
  return sents

In [110]:
eng_fra_text = to_lines(text)
eng_fra_text = array(eng_fra_text)  # converting into array
eng_fra_text[:10]

array([['Go.', 'Va !'],
       ['Go.', 'Marche.'],
       ['Go.', 'Bouge !'],
       ['Hi.', 'Salut !'],
       ['Hi.', 'Salut.'],
       ['Run!', 'Cours\u202f!'],
       ['Run!', 'Courez\u202f!'],
       ['Run!', 'Prenez vos jambes à vos cous !'],
       ['Run!', 'File !'],
       ['Run!', 'Filez !']], dtype='<U349')

In [111]:
eng_fra_text.shape

(190206, 2)

In [112]:
# we have around 1.9 million records of data 

Taking subset of the data

In [113]:
eng_fra_subset = eng_fra_text[:90000,:] # using first 90000 rows only to reduce the training time

Data Cleaning 

In [114]:
# remove puncutation

eng_fra_subset[:,0] = [s.translate(str.maketrans(" "," ",string.punctuation)) for s in eng_fra_subset[:,0]]
eng_fra_subset[:,1] = [s.translate(str.maketrans(" "," ",string.punctuation)) for s in eng_fra_subset[:,1]]

In [115]:
eng_fra_subset

array([['Go', 'Va '],
       ['Go', 'Marche'],
       ['Go', 'Bouge '],
       ...,
       ['Could you solve the problem',
        'Pourriezvous résoudre le problème\u202f'],
       ['Could you solve the problem',
        'Pourraistu résoudre le problème\u202f'],
       ['Could you speak more slowly',
        'Pouvezvous parler plus lentement\u202f']], dtype='<U349')

In [116]:
# convert text to lowercases
for i in range(len(eng_fra_subset)):
  eng_fra_subset[i,0] = eng_fra_subset[i,0].lower()
  eng_fra_subset[i,1] = eng_fra_subset[i,1].lower()
eng_fra_subset[:5]

array([['go', 'va '],
       ['go', 'marche'],
       ['go', 'bouge '],
       ['hi', 'salut '],
       ['hi', 'salut']], dtype='<U349')

Text to Sequences Conversion (word to Index Mapping)

In [117]:
# function to build a tokenizer (to build the vocabulary)
def tokenization(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer

In [118]:
# prepare english tokenizer
eng_tokenizer = tokenization(eng_fra_subset[:,0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1
print(f'English Vocabulary size {eng_vocab_size}')

# prepare french tokenizer
fre_tokenizer = tokenization(eng_fra_subset[:,1])
fre_vocab_size = len(fre_tokenizer.word_index) + 1
print(f'French Vocabulary size {fre_vocab_size}')

English Vocabulary size 8533
French Vocabulary size 20225


In [119]:
eng_len = fre_len = 8 # each sentence should be with lenght 8

In [120]:
# encode and pad sequences, padding to a max sentence length as mentiones as above

def encode_sequences(tokenizer,length,lines):
  # integer encode sequences
  seq = tokenizer.texts_to_sequences(lines)
  # pad sequences with 0 
  seq = pad_sequences(seq,maxlen=length,padding='post')
  return seq

We will encode English Setences as input and French setences as target sequences. This has to be done for both train and test datasets

In [121]:
from sklearn.model_selection import train_test_split

# X = eng_fra_subset[:,0]
# y = eng_fra_subset[:,1]

# splitting the data to train and test
train,test  = train_test_split(eng_fra_subset,test_size = 0.2,random_state = 42)

In [122]:
# prepare the training data
trainX = encode_sequences(eng_tokenizer,eng_len,train[:,0])
trainy = encode_sequences(fre_tokenizer,fre_len,train[:,1])

# prepare the testing data
testX = encode_sequences(eng_tokenizer,eng_len,test[:,0])
testy = encode_sequences(fre_tokenizer,fre_len,test[:,1])

Define our Seq2Seq model architecture:

In [123]:
# building the NMT model

def define_model(in_vocab,out_vocab,in_timesteps,out_timesteps,units):
  model = Sequential()
  # encoder
  model.add(Embedding(in_vocab,units,input_length=in_timesteps,mask_zero=True))
  model.add(LSTM(units))
  model.add(RepeatVector(out_timesteps))

  # decoder
  model.add(LSTM(units,return_sequences=True))
  model.add(Dense(out_vocab,activation='softmax'))
  
  return model

In [124]:
# we are using the RMSprop optimiser in this model as it is good choice when working with RNN networks.

# model compilation

model = define_model(eng_vocab_size,fre_vocab_size,eng_len,fre_len,512)
rms = optimizers.RMSprop(learning_rate =0.001)
model.compile(optimizer=rms,loss='sparse_categorical_crossentropy')

Training step

In [125]:
# params
epochs = 100
batch_size = 512
validation_split = 0.2

history = model.fit(trainX,trainy.reshape(trainy.shape[0],trainy.shape[1],1),
                    epochs = epochs, 
                    batch_size = batch_size,
                    validation_split = validation_split)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [154]:
model.save('/content/seq2seq_nmt.h5')

In [None]:
# loading the model
model = load_model('/content/seq2seq_nmt.h5')

In [127]:
# prediction

y_prob = model.predict(testX.reshape(testX.shape[0],testX.shape[1])[:20])

In [128]:
preds = y_prob.argmax(axis=-1)

In [143]:
preds = y_prob.argmax(axis=-1)

In [144]:
preds

array([[    5,    38,  2504,   185,     2,     0,     0,     0],
       [    5,  5641,     4,     0,     0,     0,     0,     0],
       [   89,    21,  4066,     0,     0,     0,     0,     0],
       [    1,     5,  1134,    11,   615,     0,     0,     0],
       [   46,  1324,   387,     0,     0,     0,     0,     0],
       [   18,  1081,     0,     0,     0,     0,     0,     0],
       [   13,    13,    30,    21,    21,    21,    21,     0],
       [  968,    56,    53,     0,     0,     0,     0,     0],
       [    4,    15,    55,   150,   150,     0,     0,     0],
       [    6,   115,     2,    25,     0,     0,     0,     0],
       [    1,    76,     2,     2,     0,     0,     0,     0],
       [    1,    76,    23,   215,    67,   138,     0,     0],
       [  108,     7,  2310,     0,     0,     0,     0,     0],
       [   72,  4513,  9978,     0,     0,     0,     0,     0],
       [10729,    67,     1,    23,  2011,     0,     0,     0],
       [   41,     8,   4

In [148]:
# these predictions are sequence of integers, 

def get_word(n,tokenizer):
  for token,index in tokenizer.word_index.items():
    # print(token)
    if index == n:
      return token
  return

In [149]:
get_word(432,fre_tokenizer)

'estelle'

In [150]:
preds_text = []

for i in range(preds.shape[0]):
  pred_= []
  for j in range(preds.shape[1]):
    val = preds[i,j]
    if val > 0 :
      word = get_word(preds[i,j],fre_tokenizer)
      if word != None:
        pred_.append(word)
  preds_text.append(" ".join(pred_))

In [174]:
eval_df = pd.DataFrame({'actual':test[:20,1],'predict':preds_text})
eval_df

Unnamed: 0,actual,predict
0,tu es inquiet pa...,vous êtes inquiè...
1,tu aimerais tom,vous aimeriez tom
2,avezvous une que...,astu une question
3,je te laisserai ...,je vous laissera...
4,pourquoi voudrie...,pourquoi voudrai...
5,jai promis,jai promis
6,nous nous réunis...,nous nous fait u...
7,vous y voici,ty y es
8,tom a été pris a...,tom a été par par
9,ne faites pas ça,ne fais pas ça


Evaluating the Model

##### Lets Evaluate our model on basis of metric BLEU (Bilingual Evaluation Understudy Score)

reference:  https://machinelearningmastery.com/calculate-bleu-score-for-text-python/

In [155]:
from nltk.translate.bleu_score import sentence_bleu

In [166]:
def get_bleu_score(reference,candidate):
  reference = [[word.strip() for word in reference.split()]]
  candidate = [word.strip() for word in candidate.split()]
  return sentence_bleu(reference,candidate)

In [187]:
# generate the bleu score

scores = []
for _,act,pred in eval_df.itertuples():
  score_ = sentence_bleu(act,pred)
  scores.append(score_)

eval_df['BLEU_score'] = scores

Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [188]:
eval_df.head()

Unnamed: 0,actual,predict,BLEU_score
0,tu es inquiet pa...,vous êtes inquiè...,0.784781
1,tu aimerais tom,vous aimeriez tom,0.875765
2,avezvous une que...,astu une question,0.884158
3,je te laisserai ...,je vous laissera...,0.818251
4,pourquoi voudrie...,pourquoi voudrai...,0.849891


In [15]:
print(f"On an Avearage, we are getting BLEU score of {array(scores).mean():.2f} for our Neural Machine Translation model")

On an Avearage, we are getting BLEU score of 0.83 for our Neural Machine Translation model


Approaches can imporove the model performance
- We are using only first 90K rows cause of long training time, Training on more data can improve the model
- using more sophisticated data cleaning approaches can also give more context
- Adjusting the model architecture (ie: Attention model) etc will improve the model
- Increasing the training time and adjusting the hyper parameters
