# Neural Machine Translation using attention mechanism

##### What is Machine Translation?

- Machine Translation is the task of automatically converting one natural language into another, preserving the meaning of the input text, and producing fluent text in the output language.

    There are four types of machine translation:
        - Statistical Machine Translation or SMT.
        - Rule-based Machine Translation or RBMT.
        - Hybrid Machine Translation or HMT.
        - Neural Machine Translation or NMT.

    In this project we have implemented Neural Machine Translation using Bahdanu Attention mechanism to translate the     English text to Telugu.

##### What is NMT?

- Neural machine translation, NMT for short, is the use of neural network models to learn a statistical model for machine translation.

- The key benefit to the approach is that a single system can be trained directly on source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning.

#### Area of Application NMT dealing with:
- According to the graphic below, Google Translate delivers translations from Spanish, Chinese, and French to English and vice versa with different levels of accuracy on scale with human translators.

![Analysis](https://github.com/pragathi1234/Machine_Translation/blob/main/images/google_analysis.png)





A graphic from Google Research highlighting the 2016 accuracy levels of Google Translate

- Machine translation solutions are progressively being used in more areas of business, providing new applications and improved machine-learning models, as their accuracy levels rise.
- Numerous applications evolved which where using translators now-a-days, by integrating our model to any kind of application that serves end users to understand the english text to telugu.
- Applications involves:
    - Agriculture.
    - Healthcare.
    - Finance.
    - Software and Technology.
    - Ecommerce.


#### About Data:
- We have used two datasets for modeling, data files are collected from Google Translate API and [manythings](http://www.manythings.org/anki/)
    - The Google Translate API data consist of 2 columns and 5615 rows
    - [Manythings](http://www.manythings.org/anki/) data consist of 2 columns and 88370 rows.
    - Shape of final data is (93985 X 2).
    
#### Tasks Performed.
- Loading the text file which contains two columns with English as source and Telugu as target.
- Preprocessing the data until it is good fit for modeling.
- Performed Tokenization to break the raw text into small chunks.
- Split the data into train test for modeling and evaluation.
- Modeling:
    - Implemented Encoder-Decoder with Attention.
    - Applied Predefined Hugging Face Models:
        - facebook/mbart-large-50-one-to-many-mmt.
        - Helsinki-NLP/opus-mt-en-dra.
        - Helsinki-NLP/opus-mt-en-mul.
- Evaluated all the models and compared the output with the Google translate API and retrieved the BLEU score for each model.



### Importing the required libraries.

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import nltk.translate.bleu_score as bleu
import random
import string
from sklearn.model_selection import train_test_split
import os
import time
import re

#### Reading the text file which contains English and Telugu text sentences.

In [5]:
english_sentences = []
telugu_sentences = []

# Reading the text file.
with open("english_telugu_data.txt", mode='rt') as fp:
    for line in fp.readlines():
        eng_tel = line.split("++++$++++")
        english_sentences.append(eng_tel[0])
        telugu_sentences.append(eng_tel[1])

In [6]:
# Displaying the information of text file.
df1=pd.DataFrame({"eng":english_sentences,"tel":telugu_sentences})
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155798 entries, 0 to 155797
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   eng     155798 non-null  object
 1   tel     155798 non-null  object
dtypes: object(2)
memory usage: 2.4+ MB


In [7]:
df1.head()

Unnamed: 0,eng,tel
0,His legs are long.,అతని కాళ్ళు పొడవుగా ఉన్నాయి.\n
1,Who taught Tom how to speak French?,టామ్ ఫ్రెంచ్ మాట్లాడటం ఎలా నేర్పించారు?\n
2,I swim in the sea every day.,నేను ప్రతి రోజు సముద్రంలో ఈత కొడతాను.\n
3,Tom popped into the supermarket on his way hom...,టామ్ కొంచెం పాలు కొనడానికి ఇంటికి వెళ్ళేటప్పుడ...
4,Smoke filled the room.,పొగ గదిని నింపింది.\n


#### Converting the text file to CSV

In [8]:
df2=pd.read_csv("/content/eng_tel.csv")
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5615 entries, 0 to 5614
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   English  5615 non-null   object
 1   Telugu   5615 non-null   object
dtypes: object(2)
memory usage: 87.9+ KB


In [9]:
df2.head(5)

Unnamed: 0,English,Telugu
0,politicians do not have permission to do what ...,రాజకీయ నాయకులకు చేయవలసినది చేయడానికి అనుమతి లేదు.
1,"I'd like to tell you about one such child,",అలాంటి ఒక పిల్లల గురించి నేను మీకు చెప్పాలనుకు...
2,This percentage is even greater than the perce...,ఈ శాతం భారతదేశంలో ఉన్న శాతం కంటే ఎక్కువ.
3,what we really mean is that they're bad at not...,మేము నిజంగా అర్థం ఏమిటంటే వారు శ్రద్ధ చూపకపోవడ...
4,.The ending portion of these Vedas is called U...,.ఈ వేదాల ముగింపు భాగాన్ని ఉపనిషత్తు అంటారు.


In [10]:
# Renaming the columns for easy use.
df2.rename(columns = {"English":"eng","Telugu":"tel"},inplace=True)

#### Concatenate two dataframes

In [11]:
new_df=pd.concat([df1, df2])
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 161413 entries, 0 to 5614
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   eng     161413 non-null  object
 1   tel     161413 non-null  object
dtypes: object(2)
memory usage: 3.7+ MB


In [12]:
new_df.head()

Unnamed: 0,eng,tel
0,His legs are long.,అతని కాళ్ళు పొడవుగా ఉన్నాయి.\n
1,Who taught Tom how to speak French?,టామ్ ఫ్రెంచ్ మాట్లాడటం ఎలా నేర్పించారు?\n
2,I swim in the sea every day.,నేను ప్రతి రోజు సముద్రంలో ఈత కొడతాను.\n
3,Tom popped into the supermarket on his way hom...,టామ్ కొంచెం పాలు కొనడానికి ఇంటికి వెళ్ళేటప్పుడ...
4,Smoke filled the room.,పొగ గదిని నింపింది.\n


### Preprocessing

In [13]:
exclude = set(string.punctuation) 
remove_digits = str.maketrans('', '', string.digits) 

# Cleaning the English text column.
def preprocess_eng(text):
    text = text.lower() 
    text = re.sub("'", '', text) 
    text = ''.join(ch for ch in text if ch not in exclude)
    text = text.translate(remove_digits) 
    text = text.strip()
    text = re.sub(" +", " ", text) 
    text = '<start> ' + text + ' <end>'
    return text

# Cleaning the Telugu text column. 
def preprocess_tel(text):
    text = re.sub("'", '', text) 
    text = ''.join(ch for ch in text if ch not in exclude)
    text = text.strip()
    text = re.sub(" +", " ", text) 
    text = '<start> ' + text + ' <end>'
    return text

In [14]:
# Preprocessing the both columns by adding start and end tags for each sentences in the text file.
new_df['eng'] = new_df['eng'].apply(preprocess_eng)
new_df['tel'] = new_df['tel'].apply(preprocess_tel)
new_df.head(10)

Unnamed: 0,eng,tel
0,<start> his legs are long <end>,<start> అతని కాళ్ళు పొడవుగా ఉన్నాయి <end>
1,<start> who taught tom how to speak french <end>,<start> టామ్ ఫ్రెంచ్ మాట్లాడటం ఎలా నేర్పించారు...
2,<start> i swim in the sea every day <end>,<start> నేను ప్రతి రోజు సముద్రంలో ఈత కొడతాను <...
3,<start> tom popped into the supermarket on his...,<start> టామ్ కొంచెం పాలు కొనడానికి ఇంటికి వెళ్...
4,<start> smoke filled the room <end>,<start> పొగ గదిని నింపింది <end>
5,<start> tom and mary understood each other <end>,<start> టామ్ మరియు మేరీ ఒకరినొకరు అర్థం చేసుకు...
6,<start> many men want to be thin too <end>,<start> చాలా మంది పురుషులు కూడా సన్నగా ఉండాలని...
7,<start> we need three cups <end>,<start> మాకు మూడు కప్పులు అవసరం <end>
8,<start> i warned tom not to come here <end>,<start> టామ్‌ను ఇక్కడికి రానివ్వమని హెచ్చరించా...
9,<start> you two may leave <end>,<start> మీరిద్దరూ వెళ్ళవచ్చు <end>


### Tokenization

In [15]:
# Tokenizing the text using the tensorflow module.
def tokenize(lang):

  lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='')
  lang_tokenizer.fit_on_texts(lang)

  tensor = lang_tokenizer.texts_to_sequences(lang)

  tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='post',maxlen=20,dtype='int32')

  return tensor, lang_tokenizer

- With this `tf.keras.preprocessing.sequence.pad_sequences()` method, it transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array.   


In [16]:
# Applying the tokenize() function for both the columns.
def load_dataset():

  input_tensor, inp_lang_tokenizer = tokenize(new_df['eng'].values)
  target_tensor, targ_lang_tokenizer = tokenize(new_df['tel'].values)

  return input_tensor, target_tensor, inp_lang_tokenizer, targ_lang_tokenizer

In [17]:
# Loading the dataset and intializing it to source and target variables as tensors.
input_tensor, target_tensor, inp_lang, targ_lang = load_dataset()

In [18]:
# Finding the shape of source and target variables.
max_length_targ, max_length_inp = target_tensor.shape[1], input_tensor.shape[1]

In [19]:
# Using the train_test_split() function splitting the data and finding their shapes.
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.15)

print(len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val))

137201 137201 24212 24212


In [20]:
# Initializing the buffer size, batch size, embeddings, units and epochs  
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 16
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 128
units = 1024
steps_per_epoch = len(input_tensor_train)//BATCH_SIZE

vocab_inp_size =len(inp_lang.word_index.keys())
vocab_tar_size =len(targ_lang.word_index.keys())

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

- With the help of `tf.data.Dataset.from_tensor_slices()` method, we can get the slices of an array in the form of objects

In [21]:
embeddings_index = dict()
f = open('glove.6B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_matrix = np.zeros((vocab_inp_size+1, 300))
for word, i in inp_lang.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

### Modeling

#### The encoder/decoder model

- The following diagram shows an overview of the model. At each time-step the decoder's output is combined with a weighted sum over the encoded input, to predict the next word. The diagram and formulas are from Luong's paper.

![English_Telugu](https://github.com/pragathi1234/Machine_Translation/blob/main/images/eng_tel.jpg)

#### Encoder 
- Takes a list of token IDs (from input_text_processor).
- Looks up an embedding vector for each token (Using a layers.Embedding).
- Processes the embeddings into a new sequence (Using a layers.GRU).
- Returns:     
    - The processed sequence. This will be passed to the attention head.     
    - The internal state. This will be used to initialize the decoder.

In [22]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, name="embedding_layer_encoder",trainable=False)
        self.gru = tf.keras.layers.GRU(units, return_sequences=True, return_state=True, recurrent_activation='sigmoid', recurrent_initializer='glorot_uniform')
    
    def call(self, x, hidden):
        x = self.embedding(x)
        output, state = self.gru(x, initial_state = hidden)
        return output, state
    
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.enc_units))

#### Decoder with attention
The decoder's job is to generate predictions for the next output token.
- The decoder receives the complete encoder output.
- It uses an RNN to keep track of what it has generated so far.
- It uses its RNN output as the query to the attention over the encoder's output, producing the context vector.
- It combines the RNN output and the context vector to generate the 'attention vector'.
- It generates logit predictions for the next token based on the 'attention vector'.

In [23]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(units, return_sequences=True, return_state=True, recurrent_activation='sigmoid', recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)

        # used for attention
        self.W1 = tf.keras.layers.Dense(self.dec_units)
        self.W2 = tf.keras.layers.Dense(self.dec_units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, x, hidden, enc_output):

        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        
        score = self.V(tf.nn.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis)))
        
        attention_weights = tf.nn.softmax(score, axis=1)
        
        context_vector = attention_weights * enc_output
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        x = self.embedding(x)
        
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        output, state = self.gru(x)
        
        output = tf.reshape(output, (-1, output.shape[2]))
        
        x = self.fc(output)
        
        return x, state, attention_weights
        
    def initialize_hidden_state(self):
        return tf.zeros((self.batch_sz, self.dec_units))

In [24]:
tf.keras.backend.clear_session()

encoder = Encoder(vocab_inp_size+1, 300, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size+1, embedding_dim, units, BATCH_SIZE)

In [25]:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True,
                                                            reduction='none')

def loss_function(real, pred):
  mask = tf.math.logical_not(tf.math.equal(real, 0))
  loss_ = loss_object(real, pred)

  mask = tf.cast(mask, dtype=loss_.dtype)
  loss_ *= mask

  return tf.reduce_mean(loss_)

- The `train_step()` method, added below, handles the remaining steps except for actually running the decoder.

In [26]:
@tf.function
def train_step(inp, targ, enc_hidden):
  loss = 0

  with tf.GradientTape() as tape:
    enc_output, enc_hidden = encoder(inp, enc_hidden)
    encoder.get_layer('embedding_layer_encoder').set_weights([embedding_matrix])
    dec_hidden = enc_hidden

    dec_input = tf.expand_dims([targ_lang.word_index['<start>']] * BATCH_SIZE, 1)

    for t in range(1, targ.shape[1]):
      predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)

      loss += loss_function(targ[:, t], predictions)

      dec_input = tf.expand_dims(targ[:, t], 1)

  batch_loss = (loss / int(targ.shape[1]))

  variables = encoder.trainable_variables + decoder.trainable_variables

  gradients = tape.gradient(loss, variables)

  optimizer.apply_gradients(zip(gradients, variables))

  return batch_loss

In [27]:
EPOCHS = 20

for epoch in range(EPOCHS):
  start = time.time()

  enc_hidden = encoder.initialize_hidden_state()
  total_loss = 0

  for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
    batch_loss = train_step(inp, targ, enc_hidden)
    total_loss += batch_loss

    if batch % 1000 == 0:
      print(f'Epoch {epoch+1} Batch {batch} Loss {batch_loss.numpy():.4f}')

  print(f'Epoch {epoch+1} Loss {total_loss/steps_per_epoch:.4f}')
  print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

Epoch 1 Batch 0 Loss 3.4770
Epoch 1 Batch 1000 Loss 1.6304
Epoch 1 Batch 2000 Loss 1.3880
Epoch 1 Batch 3000 Loss 1.3756
Epoch 1 Batch 4000 Loss 1.0877
Epoch 1 Batch 5000 Loss 0.6772
Epoch 1 Batch 6000 Loss 1.2839
Epoch 1 Batch 7000 Loss 1.2911
Epoch 1 Batch 8000 Loss 0.4900
Epoch 1 Loss 1.1557
Time taken for 1 epoch 1112.98 sec

Epoch 2 Batch 0 Loss 0.5925
Epoch 2 Batch 1000 Loss 0.5091
Epoch 2 Batch 2000 Loss 0.6630
Epoch 2 Batch 3000 Loss 0.4059
Epoch 2 Batch 4000 Loss 0.6943
Epoch 2 Batch 5000 Loss 0.6147
Epoch 2 Batch 6000 Loss 0.5392
Epoch 2 Batch 7000 Loss 0.4819
Epoch 2 Batch 8000 Loss 0.9233
Epoch 2 Loss 0.6053
Time taken for 1 epoch 1086.81 sec

Epoch 3 Batch 0 Loss 0.8343
Epoch 3 Batch 1000 Loss 0.2279
Epoch 3 Batch 2000 Loss 0.2498
Epoch 3 Batch 3000 Loss 0.3756
Epoch 3 Batch 4000 Loss 0.6343
Epoch 3 Batch 5000 Loss 0.5093
Epoch 3 Batch 6000 Loss 0.2422
Epoch 3 Batch 7000 Loss 0.3614
Epoch 3 Batch 8000 Loss 0.2751
Epoch 3 Loss 0.3850
Time taken for 1 epoch 1086.59 sec

Epoc

###  Model Evaluation

In [28]:
def evaluate(sentence):
  attention_plot = np.zeros((max_length_targ, max_length_inp))

  sentence = preprocess_eng(sentence)

  inputs = [inp_lang.word_index[i] for i in sentence.split(' ')]
  inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs],maxlen=20, padding='post')
  inputs = tf.convert_to_tensor(inputs)

  result = ''

  hidden = [tf.zeros((1, units))]
  enc_out, enc_hidden = encoder(inputs, hidden)

  dec_hidden = enc_hidden
  dec_input = tf.expand_dims([targ_lang.word_index['<start>']], 0)

  for t in range(max_length_targ):
    predictions, dec_hidden, attention_weights = decoder(dec_input,
                                                         dec_hidden,
                                                         enc_out)
    # storing the attention weights to plot later on
    attention_weights = tf.reshape(attention_weights, (-1, ))
    attention_plot[t] = attention_weights.numpy()
    predicted_id = tf.argmax(predictions[0]).numpy()

    result += targ_lang.index_word[predicted_id] + ' '

    if targ_lang.index_word[predicted_id] == '<end>':
      return result,attention_plot

    # the predicted ID is fed back into the model
    dec_input = tf.expand_dims([predicted_id], 0)

  return result,attention_plot

In [51]:
input_sentence= 'please ensure that you use the appropriate form '
print('Input sentence in english : ',input_sentence)
predicted_output_1,attention_plot=evaluate(input_sentence)
print('Predicted sentence in telugu : ',predicted_output_1)

Input sentence in english :  please ensure that you use the appropriate form 
Predicted sentence in telugu :  మీరు అవసరమైన విధంగా దరఖాస్తు చేసుకోవాలి <end> 


In [52]:
input_sentence="Hello my friends! How are you doing today?"
print('Input sentence in english : ',input_sentence)
predicted_output_2,attention_plot=evaluate(input_sentence)
print('Predicted sentence in telugu : ',predicted_output_2)

Input sentence in english :  Hello my friends! How are you doing today?
Predicted sentence in telugu :  ఈ రోజు మీ ఆలస్యంగా ఎలా ధన్యవాదాలు <end> 


In [31]:
!pip install googletrans==3.1.0a0

Collecting googletrans==3.1.0a0
  Downloading googletrans-3.1.0a0.tar.gz (19 kB)
Collecting httpx==0.13.3
  Downloading httpx-0.13.3-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 3.0 MB/s 
Collecting rfc3986<2,>=1.3
  Downloading rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting httpcore==0.9.*
  Downloading httpcore-0.9.1-py3-none-any.whl (42 kB)
[K     |████████████████████████████████| 42 kB 1.4 MB/s 
[?25hCollecting hstspreload
  Downloading hstspreload-2021.12.1-py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 8.5 MB/s 
[?25hCollecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting h11<0.10,>=0.8
  Downloading h11-0.9.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.9 MB/s 
[?25hCollecting h2==3.*
  Downloading h2-3.2.0-py2.py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 4.8 MB/s 
[?25hCollecting hpack<4,>=3.0
  Downloading hpack-3.0.0

In [53]:
from googletrans import Translator
translator=Translator()
out=translator.translate("Hello my friends! How are you doing today?",dest="te")
print(out.text)

హలో నా స్నేహితులారా! ఈరోజు మీరు ఎలా ఉన్నారు?


In [54]:
from nltk.translate.bleu_score import sentence_bleu 
from nltk.translate.bleu_score import SmoothingFunction 

In [55]:
from nltk.translate.bleu_score import sentence_bleu
reference = out.text
candidate = predicted_output_1
score = sentence_bleu(reference, candidate)
print("BLEU score for encoder-decoder model with attention is: ",score)

BLEU score for encoder-decoder model with attention is:  0.7427498127683173


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


In [56]:
reference_1 = out.text
candidate_1 = predicted_output_2 
score_1 = sentence_bleu(reference_1, candidate_1)
print("BLEU score for encoder-decoder model with attention mechanism is: ",score_1)

BLEU score for encoder-decoder model with attention mechanism is:  0.7691605673134586


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Machine Translation using Transformers

In [36]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.2 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 46.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 82.7 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 9.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 91.5 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created whe

In [37]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.1 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


In [38]:
from transformers import pipeline, MarianTokenizer, MarianMTModel
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

### facebook/mbart-large-50-one-to-many-mmt 


- This [model](https://huggingface.co/facebook/mbart-large-50-one-to-many-mmt) is a fine-tuned version of the mBART-large-50 checkpoint.
- It has been fine-tuned for machine translation into multiple languages.
- It was introduced in Multilingual Translation with Extensible Multilingual Pretraining and Finetuning paper.
- The model can convert between English and 49 other languages. The target language id is forced as the first generated token when translating into a target language.

In [67]:
model_name="facebook/mbart-large-50-one-to-many-mmt"

tokenizer = MBart50TokenizerFast.from_pretrained(model_name, src_lang="en_XX")
model = MBartForConditionalGeneration.from_pretrained(model_name)

input = "Hello my friends! How are you doing today?"
model_inputs = tokenizer(input, return_tensors="pt")

generated_tokens = model.generate(**model_inputs,forced_bos_token_id=tokenizer.lang_code_to_id["te_IN"])
res=tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

In [72]:
print("The translated text is: {}".format(res[0]))

The translated text is: హలో నా స్నేహితులు!


In [68]:
reference_2 = out.text
candidate_2 = res[0]
score_2 = sentence_bleu(reference_2, candidate_2)
print("BLEU score for 'facebook/mbart-large-50-one-to-many-mmt' model is: ",score_2)

BLEU score for 'facebook/mbart-large-50-one-to-many-mmt' model is:  0.9218658175671758


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Helsinki-NLP/opus-mt-en-dra


- The [model](https://huggingface.co/Helsinki-NLP/opus-mt-en-dra) can translate English to dravidian languages. 
- Pre-processing: normalization + SentencePiece.
- A sentence initial language token is required in the form of >>id<< (id = valid target language ID).
- Bleu score - 7.1

In [69]:
model_name = 'Helsinki-NLP/opus-mt-en-dra' 

tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

translation_engine = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

text_to_translate = "Hello my friends! How are you doing today?"

translated_text = translation_engine(">>tel<<" +text_to_translate)

In [70]:
print("The translated text is: {}".format(translated_text[0]["generated_text"]))

The translated text is: హలో నా స్నేహితులు!


In [71]:
reference_3 = out.text
candidate_3 = translated_text[0]["generated_text"]
score_3 = sentence_bleu(reference_3, candidate_3)
print("BLEU score for 'Helsinki-NLP/opus-mt-en-dra' model is: ",score_3)

BLEU score for 'Helsinki-NLP/opus-mt-en-dra' model is:  0.9218658175671758


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Helsinki-NLP/opus-mt-en-mul


- The [model](https://huggingface.co/Helsinki-NLP/opus-mt-en-mul) can translate from English to a variety of other languages.
- Pre-processing: Normalization + SentencePiece
- A sentence initial language token is required in the form of >>id<<
- BLEU score - 4.7

In [73]:
model_name = 'Helsinki-NLP/opus-mt-en-mul' 

tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

translation_engine = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

text_to_translate = "Hello my friends! How are you doing today?" 

translated_text = translation_engine(">>tel<<" +text_to_translate)

In [74]:
print("The translated text is: {}".format(translated_text[0]["generated_text"]))

The translated text is: హలో నా స్నేహితులు, మీరు నేడు ఎలా చేస్తున్నారు?


In [75]:
reference_4 = out.text
candidate_4 = translated_text[0]["generated_text"]
score_4 = sentence_bleu(reference_4, candidate_4)
print("BLEU score for 'Helsinki-NLP/opus-mt-en-mul model'  is: ",score_4)

BLEU score for 'Helsinki-NLP/opus-mt-en-mul model'  is:  0.7796914510717229


Corpus/Sentence contains 0 counts of 2-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


### Results:

In [76]:
models=["EncoderDecoder with Attention", "facebook/mbart-large-50-one-to-many-mmt",
        "Helsinki-NLP/opus-mt-en-dra", "Helsinki-NLP/opus-mt-en-mul"]

scores=[score_1,score_2,score_3,score_4]

results={'Models':models,'BLEU scores':scores}
res=pd.DataFrame(results)

In [77]:
res.head()

Unnamed: 0,Models,BLEU scores
0,EncoderDecoder with Attention,0.769161
1,facebook/mbart-large-50-one-to-many-mmt,0.921866
2,Helsinki-NLP/opus-mt-en-dra,0.921866
3,Helsinki-NLP/opus-mt-en-mul,0.779691


### Conclusion:
In this project, we propose different mechanisms for neural machine translation: the global approach that looks at all source positions at all times, and a predefined model. When compared to the Helsinki-NLP/opus-mt-en-dra and facebook/mbart-large-50-one-to-many-mmt models, the Helsinki-NLP/opus-mt-en-mul model outperformed. We put our models to the test in NMT translation tasks between English to Telugu.


### References:

- https://www.tensorflow.org/text/tutorials/nmt_with_attention
- https://arxiv.org/pdf/1508.04025.pdf
- https://towardsdatascience.com/neural-machine-translation-nmt-with-attention-mechanism-5e59b57bd2ac
- https://emerj.com/ai-sector-overviews/machine-translation-14-current-applications-and-services/
