# Neural Machine Translation (NMT) - Translating English sentences to Vietnam sentences

Machine Translation refers to translating phrases across languages using deep learning and specifically with RNN ( Recurrent Neural Nets ). Most of these are complex systems that is they are a combined system of various algorithms. But, at its core, NMT uses sequence-to-sequence ( seq2seq ) RNN cells. Such models could be character level but word level models remain common.

![NMT system](https://3.bp.blogspot.com/-3Pbj_dvt0Vo/V-qe-Nl6P5I/AAAAAAAABQc/z0_6WtVWtvARtMk0i9_AtLeyyGyV6AI4wCLcB/s1600/nmt-model-fast.gif)

I insist to change the runtime to a GPU runtime so that training could be faster.

## What are we going to do?
We will basically create an encoder-decoder LSTM model using [Keras Functional API](https://www.tensorflow.org/alpha/guide/keras/functional) ( with [TensorFlow](https://www.tensorflow.org/) ). We will convert the English sentences to VietNam, but why VietNam


*   Has special characters and much complex.


Here's an example,

Hello --> Xin chào

So, let's get started.



## Preparing the Data

### 1) Importing the libraries

We will import TensorFlow and Keras. From Keras, we import various modules which help in building NN layers, preprocess data and construct LSTM models.

In [3]:
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers , activations , models , preprocessing , utils
import pandas as pd
import json
import os
import pickle

In [2]:
from google.colab import drive
drive.mount('/content/drive')
path_to_save = "/content/drive/MyDrive/Colab Notebooks/LSTM"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 2) Reading the data


Our dataset which contains more than 30K pairs of English-VietNam phrases. This amazing dataset is available at http://www.manythings.org/anki/ and it also other 50+ sets of bilingual sentences. We download the dataset for English-VietNam phrases, unzip it and read it using [Pandas](https://pandas.pydata.org/).

In [3]:

!wget http://www.manythings.org/anki/vie-eng.zip -O vie-eng.zip
!unzip vie-eng.zip


--2022-06-22 06:27:33--  http://www.manythings.org/anki/vie-eng.zip
Resolving www.manythings.org (www.manythings.org)... 104.21.92.44, 172.67.186.54, 2606:4700:3033::ac43:ba36, ...
Connecting to www.manythings.org (www.manythings.org)|104.21.92.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 320614 (313K) [application/zip]
Saving to: ‘vie-eng.zip’


2022-06-22 06:27:33 (9.82 MB/s) - ‘vie-eng.zip’ saved [320614/320614]

Archive:  vie-eng.zip
replace _about.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: _about.txt              
replace vie.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: vie.txt                 


In [4]:
lines = pd.read_table( 'vie.txt' , names=[ 'eng' , 'vie' ] )

In [5]:
lines.reset_index( level=0 , inplace=True )

In [6]:
lines

Unnamed: 0,index,eng,vie
0,Run!,Chạy!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
1,Help!,Giúp tôi với!,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
2,Go on.,Tiếp tục đi.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Hello!,Chào bạn.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
4,Hurry!,Nhanh lên nào!,CC-BY 2.0 (France) Attribution: tatoeba.org #1...
...,...,...,...
8045,"In 2009, Selena Gomez became the youngest pers...","Vào năm 2009, Sê-lê-na Gô-mét đã được lựa chọn...",CC-BY 2.0 (France) Attribution: tatoeba.org #5...
8046,"In 2009, Selena Gomez became the youngest pers...","Vào năm 2009, Selena Gomez đã được lựa chọn để...",CC-BY 2.0 (France) Attribution: tatoeba.org #5...
8047,"In 2009, Selena Gomez became the youngest pers...","Vào năm 2009, Selena Gomez đã trở thành Đại sứ...",CC-BY 2.0 (France) Attribution: tatoeba.org #5...
8048,The people here are particular about what they...,Những người ở đây khá là khó tính về khẩu vị ă...,CC-BY 2.0 (France) Attribution: tatoeba.org #2...


In [6]:
lines.rename( columns={ 'index' : 'eng' , 'eng' : 'vie' , 'vie' : 'c' } , inplace=True )

In [6]:
lines.head()

Unnamed: 0,eng,vie,c
0,Run!,Chạy!,CC-BY 2.0 (France) Attribution: tatoeba.org #9...
1,Help!,Giúp tôi với!,CC-BY 2.0 (France) Attribution: tatoeba.org #4...
2,Go on.,Tiếp tục đi.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Hello!,Chào bạn.,CC-BY 2.0 (France) Attribution: tatoeba.org #3...
4,Hurry!,Nhanh lên nào!,CC-BY 2.0 (France) Attribution: tatoeba.org #1...


### 3) Preparing input data for the Encoder ( `encoder_input_data` )
The Encoder model will be fed input data which are preprocessed English sentences. The preprocessing is done as follows :


1.   Tokenizing the English sentences from `eng_lines`.
2.   Determining the maximum length of the English sentence that's `max_input_length`.
3.   Padding the `tokenized_eng_lines` to the max_input_length.
4.   Determining the vocabulary size ( `num_eng_tokens` ) for English words.





In [7]:
eng_lines = list()
for line in lines.eng:
    eng_lines.append( line ) 

In [8]:
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( eng_lines ) 
tokenized_eng_lines = tokenizer.texts_to_sequences( eng_lines ) 

In [9]:
length_list = list()
for token_seq in tokenized_eng_lines:
    length_list.append( len( token_seq ))
max_input_length = np.array( length_list ).max()
print( 'English max length is {}'.format( max_input_length ))

English max length is 32


In [10]:
padded_eng_lines = preprocessing.sequence.pad_sequences( tokenized_eng_lines , maxlen=max_input_length , padding='post' )
encoder_input_data = np.array( padded_eng_lines )
print( 'Encoder input data shape -> {}'.format( encoder_input_data.shape ))

Encoder input data shape -> (8050, 32)


In [14]:
encoder_input_data[18]

array([   9, 1350,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0],
      dtype=int32)

In [13]:
eng_word_dict = tokenizer.word_index
num_eng_tokens = len( eng_word_dict )+1
print( 'Number of English tokens = {}'.format( num_eng_tokens))

Number of English tokens = 3802


In [16]:
eng_word_dict

{'i': 1,
 'to': 2,
 'the': 3,
 'tom': 4,
 'you': 5,
 'a': 6,
 'is': 7,
 'that': 8,
 'he': 9,
 'do': 10,
 'in': 11,
 'of': 12,
 'me': 13,
 'this': 14,
 'have': 15,
 "don't": 16,
 'it': 17,
 'was': 18,
 'my': 19,
 'for': 20,
 "i'm": 21,
 'are': 22,
 'mary': 23,
 'your': 24,
 'we': 25,
 'she': 26,
 'what': 27,
 'want': 28,
 'be': 29,
 'at': 30,
 'with': 31,
 'like': 32,
 'on': 33,
 'his': 34,
 'think': 35,
 'know': 36,
 'not': 37,
 'and': 38,
 'can': 39,
 'has': 40,
 'did': 41,
 'go': 42,
 'very': 43,
 'will': 44,
 'how': 45,
 'there': 46,
 "didn't": 47,
 'going': 48,
 'here': 49,
 'time': 50,
 'get': 51,
 "it's": 52,
 'all': 53,
 'up': 54,
 'no': 55,
 "can't": 56,
 'an': 57,
 'as': 58,
 'had': 59,
 'about': 60,
 'him': 61,
 'one': 62,
 'from': 63,
 'why': 64,
 'if': 65,
 'when': 66,
 'they': 67,
 'but': 68,
 'out': 69,
 'more': 70,
 'her': 71,
 'said': 72,
 'who': 73,
 'by': 74,
 "i'll": 75,
 'come': 76,
 'need': 77,
 'than': 78,
 'would': 79,
 'never': 80,
 "isn't": 81,
 'home': 82,
 'r

In [14]:
fp = open(os.path.join(path_to_save,'eng_word_dict.pkl'), 'wb')
pickle.dump(eng_word_dict, fp)
fp.close()

### 4) Preparing input data for the Decoder ( `decoder_input_data` )
The Decoder model will be fed the preprocessed VietNam lines. The preprocessing steps are similar to the ones which are above. This one step is carried out before the other steps.


*   Append `<START>` tag at the first position in  each VietNam sentence.
*   Append `<END>` tag at the last position in  each VietNam sentence.





In [15]:
vie_lines = list()
for line in lines.vie:
    vie_lines.append( '<START> ' + line + ' <END>' )  

In [16]:
tokenizer = preprocessing.text.Tokenizer()
tokenizer.fit_on_texts( vie_lines ) 
tokenized_vie_lines = tokenizer.texts_to_sequences( vie_lines ) 

In [17]:
length_list = list()
for token_seq in tokenized_vie_lines:
    length_list.append( len( token_seq ))
max_output_length = np.array( length_list ).max()
print( 'Vietnam max length is {}'.format( max_output_length ))

Vietnam max length is 43


In [18]:
padded_vie_lines = preprocessing.sequence.pad_sequences( tokenized_vie_lines , maxlen=max_output_length, padding='post' )
decoder_input_data = np.array( padded_vie_lines )
print( 'Decoder input data shape -> {}'.format( decoder_input_data.shape ))

Decoder input data shape -> (8050, 43)


In [19]:
vie_word_dict = tokenizer.word_index
num_vie_tokens = len( vie_word_dict )+1
print( 'Number of Vietnam tokens = {}'.format( num_vie_tokens))

Number of Vietnam tokens = 2384


In [20]:
fp = open(os.path.join(path_to_save,'vie_word_dict.pkl'), 'wb')
pickle.dump(vie_word_dict, fp)
fp.close()

### 5) Preparing target data for the Decoder ( decoder_target_data ) 

We take a copy of `tokenized_mar_lines` and modify it like this.



1.   We remove the `<start>` tag which we appended earlier. Hence, the word ( which is `<start>` in this case  ) will be removed.
2.   Convert the `padded_mar_lines` ( ones which do not have `<start>` tag ) to one-hot vectors.

For example :

```
 [ '<start>' , 'hello' , 'world' , '<end>' ]

```

wil become 

```
 [ 'hello' , 'world' , '<end>' ]

```


In [21]:

decoder_target_data = list()
for token_seq in tokenized_vie_lines:
    decoder_target_data.append( token_seq[ 1 : ] ) 
    
padded_vie_lines = preprocessing.sequence.pad_sequences( decoder_target_data , maxlen=max_output_length, padding='post' )
onehot_vie_lines = utils.to_categorical( padded_vie_lines , num_vie_tokens )
decoder_target_data = np.array( onehot_vie_lines )
print( 'Decoder target data shape -> {}'.format( decoder_target_data.shape ))


Decoder target data shape -> (8050, 43, 2384)


## Defining and Training the models

### 1) Defining the Encoder-Decoder model
The model will have Embedding, LSTM and Dense layers. The basic configuration is as follows.


*   2 Input Layers : One for `encoder_input_data` and another for `decoder_input_data`.
*   Embedding layer : For converting token vectors to fix sized dense vectors. **( Note :  Don't forget the `mask_zero=True` argument here )**
*   LSTM layer : Provide access to Long-Short Term cells.

Working : 

1.   The `encoder_input_data` comes in the Embedding layer (  `encoder_embedding` ). 
2.   The output of the Embedding layer goes to the LSTM cell which produces 2 state vectors ( `h` and `c` which are `encoder_states` )
3.   These states are set in the LSTM cell of the decoder.
4.   The decoder_input_data comes in through the Embedding layer.
5.   The Embeddings goes in LSTM cell ( which had the states ) to produce seqeunces.









In [22]:

encoder_inputs = tf.keras.layers.Input(shape=( None , ), name = 'enc_input')
encoder_embedding = tf.keras.layers.Embedding( num_eng_tokens, 256 , mask_zero=True,name = 'enc_embedding') (encoder_inputs)
encoder_outputs , state_h , state_c = tf.keras.layers.LSTM( 128 , return_state=True,name = 'enc_output'  )( encoder_embedding )
encoder_states = [ state_h , state_c ]
decoder_inputs = tf.keras.layers.Input(shape=( None ,  ), name = 'dec_input')
decoder_embedding = tf.keras.layers.Embedding( num_vie_tokens, 256 , mask_zero=True,name = 'dec_embedding') (decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM( 128 , return_state=True , return_sequences=True, name = 'decoder_lstm')
decoder_outputs , _ , _ = decoder_lstm ( decoder_embedding , initial_state=encoder_states )
decoder_dense = tf.keras.layers.Dense( num_vie_tokens , activation=tf.keras.activations.softmax, name = 'decoder_dense' ) 
output = decoder_dense ( decoder_outputs )

model = tf.keras.models.Model([encoder_inputs, decoder_inputs], output )
model.compile(optimizer=tf.keras.optimizers.RMSprop(), loss='categorical_crossentropy')

model.summary()


Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 enc_input (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 dec_input (InputLayer)         [(None, None)]       0           []                               
                                                                                                  
 enc_embedding (Embedding)      (None, None, 256)    973312      ['enc_input[0][0]']              
                                                                                                  
 dec_embedding (Embedding)      (None, None, 256)    610304      ['dec_input[0][0]']              
                                                                                              

### 2) Training the model
We train the model for a number of epochs with RMSprop optimizer and categorical crossentropy loss function.

In [23]:
# model.fit([encoder_input_data , decoder_input_data], decoder_target_data, batch_size=200, epochs=10) 
model.fit([encoder_input_data , decoder_input_data], decoder_target_data, batch_size=250, epochs=100, verbose=2)

Epoch 1/100
33/33 - 41s - loss: 1.4244 - 41s/epoch - 1s/step
Epoch 2/100
33/33 - 34s - loss: 1.2562 - 34s/epoch - 1s/step
Epoch 3/100
33/33 - 33s - loss: 1.2040 - 33s/epoch - 1s/step
Epoch 4/100
33/33 - 33s - loss: 1.1637 - 33s/epoch - 1s/step
Epoch 5/100
33/33 - 35s - loss: 1.1324 - 35s/epoch - 1s/step
Epoch 6/100
33/33 - 35s - loss: 1.1048 - 35s/epoch - 1s/step
Epoch 7/100
33/33 - 33s - loss: 1.0782 - 33s/epoch - 1s/step
Epoch 8/100
33/33 - 33s - loss: 1.0505 - 33s/epoch - 1s/step
Epoch 9/100
33/33 - 33s - loss: 1.0227 - 33s/epoch - 1s/step
Epoch 10/100
33/33 - 33s - loss: 0.9967 - 33s/epoch - 1s/step
Epoch 11/100
33/33 - 33s - loss: 0.9729 - 33s/epoch - 1s/step
Epoch 12/100
33/33 - 33s - loss: 0.9502 - 33s/epoch - 1s/step
Epoch 13/100
33/33 - 36s - loss: 0.9301 - 36s/epoch - 1s/step
Epoch 14/100
33/33 - 33s - loss: 0.9108 - 33s/epoch - 1s/step
Epoch 15/100
33/33 - 35s - loss: 0.8926 - 35s/epoch - 1s/step
Epoch 16/100
33/33 - 33s - loss: 0.8758 - 33s/epoch - 1s/step
Epoch 17/100
33/3

<keras.callbacks.History at 0x7f1595ed0fd0>

In [24]:
model.save(os.path.join(path_to_save,'saved_model'))



INFO:tensorflow:Assets written to: /content/drive/MyDrive/Colab Notebooks/LSTM/saved_model/assets


INFO:tensorflow:Assets written to: /content/drive/MyDrive/Colab Notebooks/LSTM/saved_model/assets


## Inferencing on the models

In [17]:
import tensorflow as tf
from tensorflow.keras import layers , activations , models , preprocessing , utils
import numpy as np
import json
import os
import pickle
#### thay path
path = r"LSTM"
# ####
# from google.colab import drive
# drive.mount('/content/drive')

eng_word_dict = dict()
vie_word_dict = dict()
fp = open(os.path.join(path,'eng_word_dict.pkl'), 'rb')
eng_word_dict = pickle.load(fp)
fp.close()
fp = open(os.path.join(path,'vie_word_dict.pkl'), 'rb')
vie_word_dict = pickle.load(fp)
fp.close()
num_eng_tokens = len(eng_word_dict) + 1
num_vie_tokens = len(vie_word_dict) + 1

model = tf.keras.models.load_model(os.path.join(path,'saved_model'))

2022-06-23 10:32:52.504607: W tensorflow/core/common_runtime/graph_constructor.cc:805] Node 'cond/while' has 13 outputs but the _output_shapes attribute specifies shapes for 46 outputs. Output shapes may be inaccurate.
2022-06-23 10:32:52.705832: W tensorflow/core/common_runtime/graph_constructor.cc:805] Node 'cond' has 5 outputs but the _output_shapes attribute specifies shapes for 46 outputs. Output shapes may be inaccurate.
2022-06-23 10:32:53.221574: W tensorflow/core/common_runtime/graph_constructor.cc:805] Node 'cond/while' has 13 outputs but the _output_shapes attribute specifies shapes for 46 outputs. Output shapes may be inaccurate.
2022-06-23 10:32:53.504454: W tensorflow/core/common_runtime/graph_constructor.cc:805] Node 'cond/while' has 13 outputs but the _output_shapes attribute specifies shapes for 46 outputs. Output shapes may be inaccurate.
2022-06-23 10:32:53.518852: W tensorflow/core/common_runtime/graph_constructor.cc:805] Node 'cond' has 5 outputs but the _output_sh

In [18]:

def make_inference_models():
    encoder_inputs = model.get_layer("enc_input").output
    encoder_outputs, state_h , state_c = model.get_layer("enc_output").output
    encoder_states = [ state_h , state_c ]
    encoder_model = tf.keras.models.Model(encoder_inputs, encoder_states)
    

    decoder_state_input_h = tf.keras.layers.Input(shape=( 128 ,))
    decoder_state_input_c = tf.keras.layers.Input(shape=( 128 ,))
    
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    
    decoder_embedding = model.get_layer("dec_embedding")
    decoder_inputs = model.get_layer("dec_input")

    decoder_outputs, state_h, state_c = model.get_layer("decoder_lstm")(
        decoder_embedding.output , initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = model.get_layer("decoder_dense")(decoder_outputs)
    decoder_model = tf.keras.models.Model(
        [decoder_inputs.output] + decoder_states_inputs,
        [decoder_outputs] + decoder_states)
    
    return encoder_model , decoder_model

max_input_length = 32
max_output_length = 43
def str_to_tokens( sentence : str ):
    words = sentence.lower().split()
    tokens_list = list()
    for word in words:
        tokens_list.append( eng_word_dict[ word ] ) 
    return preprocessing.sequence.pad_sequences( [tokens_list] , maxlen=max_input_length , padding='post')


### 2) Making some translations


1.   First, we take a English sequence and predict the state values using `enc_model`.
2.   We set the state values in the decoder's LSTM.
3.   Then, we generate a sequence which contains the `<start>` element.
4.   We input this sequence in the `dec_model`.
5.   We replace the `<start>` element with the element which was predicted by the `dec_model` and update the state values.
6.   We carry out the above steps iteratively till we hit the `<end>` tag or the maximum sequence length.







In [20]:
import speech_recognition as sr
from gtts import gTTS
import playsound
import os

# Initialize the recognizer
r = sr.Recognizer()

# Loop infinitely for user to
# speak
 
while(1):   
     
    # Exception handling to handle
    # exceptions at the runtime
    try:
         
        # use the microphone as source for input.
        with sr.Microphone() as source2:
             
            # wait for a second to let the recognizer
            # adjust the energy threshold based on
            # the surrounding noise level
            r.adjust_for_ambient_noise(source2, duration=0.2)
             
            #listens for the user's input
            audio2 = r.listen(source2)
             
            # Using google to recognize audio
            MyText = r.recognize_google(audio2)
            MyText = MyText.lower()
 
            print("Did you say: "+ MyText)

            enc_model , dec_model = make_inference_models()
            
            states_values = enc_model.predict( str_to_tokens( MyText ) )
            #states_values = enc_model.predict( encoder_input_data[ epoch ] )
            empty_target_seq = np.zeros( ( 1 , 1 ) )
            empty_target_seq[0, 0] = vie_word_dict['start']
            stop_condition = False
            decoded_translation = ''
            while not stop_condition :
                dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
                sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
                sampled_word = None
                for word , index in vie_word_dict.items() :
                    if sampled_word_index == index :
                        decoded_translation += ' {}'.format( word )
                        sampled_word = word
                
                if sampled_word == 'end' or len(decoded_translation.split()) > max_output_length:
                    stop_condition = True
                    
                empty_target_seq = np.zeros( ( 1 , 1 ) )  
                empty_target_seq[ 0 , 0 ] = sampled_word_index
                states_values = [ h , c ] 

            print( decoded_translation[:-4] )
                
            
            # text = "Em nhà ở đâu thế" 
            output = gTTS(decoded_translation[:-4], lang="vi", slow=False)
            output.save("output.mp3")
            playsound.playsound('output.mp3', True)
            # os.remove("output.mp3")
             
    except sr.RequestError as e:
        print("Could not request results; {0}".format(e))
         
    except sr.UnknownValueError:
        print("unknown error occured")

unknown error occured


KeyboardInterrupt: 

In [7]:

enc_model , dec_model = make_inference_models()

#for epoch in range( encoder_input_data.shape[0] ):
while(1):
    states_values = enc_model.predict( str_to_tokens( input( 'Enter eng sentence : ' ) ) )
    #states_values = enc_model.predict( encoder_input_data[ epoch ] )
    empty_target_seq = np.zeros( ( 1 , 1 ) )
    empty_target_seq[0, 0] = vie_word_dict['start']
    stop_condition = False
    decoded_translation = ''
    while not stop_condition :
        dec_outputs , h , c = dec_model.predict([ empty_target_seq ] + states_values )
        sampled_word_index = np.argmax( dec_outputs[0, -1, :] )
        sampled_word = None
        for word , index in vie_word_dict.items() :
            if sampled_word_index == index :
                decoded_translation += ' {}'.format( word )
                sampled_word = word
        
        if sampled_word == 'end' or len(decoded_translation.split()) > max_output_length:
            stop_condition = True
            
        empty_target_seq = np.zeros( ( 1 , 1 ) )  
        empty_target_seq[ 0 , 0 ] = sampled_word_index
        states_values = [ h , c ] 

    print( decoded_translation[:-4] )


 chào bạn
 bạn là cái nào thế
 chúng tôi biết
 chúng tôi biết
 hãy nói thật đi
