# 🪔 Project 3: English to Hindi (Text Translation)

🧾**Description:** The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi as well as monolingual Hindi corpus collected from a variety of existing sources and corpora developed at the Center for Indian Language Technology, IIT Bombay over the years. This page describes the corpus. This corpus has been used at the Workshop on Asian Language Translation Shared Task since 2016 the Hindi-to-English and English-to-Hindi languages pairs and as a pivot language pair for the Hindi-to-Japanese and Japanese-to-Hindi language pairs.

source of the dataset - https://www.cfilt.iitb.ac.in/iitb_parallel/

research paper - The IIT Bombay English-Hindi Parallel Corpus  - https://arxiv.org/pdf/1710.02855.pdf

🧭 **Problem Statement:** You are provided with a large dataset of language pairs, parallelly in English and Hindi: you have to perform a step-by-step NLP approach to translate English to Hindi after splitting the dataset into train-test-validation sets.  


In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
import string
from string import digits
import matplotlib.pyplot as plt
%matplotlib inline
import re

import seaborn as sns
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from keras.layers import Input, LSTM, Embedding, Dense
from keras.models import Model

#print(os.listdir("../input"))

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)



  pd.set_option('display.max_colwidth', -1)


In [2]:
df=pd.read_csv("hindi_english_parallel.csv",encoding='utf-8')

In [3]:
df.head()

Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,Give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,Accerciser Accessibility Explorer
2,निचले पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the top panel
4,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से निष्क्रिय किया गया है,A list of plugins that are disabled by default


In [4]:
df.shape

(1561841, 2)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1561841 entries, 0 to 1561840
Data columns (total 2 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   hindi    1555785 non-null  object
 1   english  1561116 non-null  object
dtypes: object(2)
memory usage: 23.8+ MB


In [6]:
# Checking for NULL values
round((df.isnull().sum()/len(df['hindi']))*100, 2)

hindi      0.39
english    0.05
dtype: float64

In [7]:
# Since 40% of data in hindi have no translation, it is of no use to have it and we have to drop them
df = df.dropna()

In [8]:
df.isnull().sum()

hindi      0
english    0
dtype: int64

In [9]:
df.shape

(1555727, 2)

In [10]:
df.drop_duplicates(inplace=True)

In [11]:
df.shape

(1353912, 2)

df['hindi'].value_counts()

df=df[df['source']=='ted']

In [12]:
df.head(20)

Unnamed: 0,hindi,english
0,अपने अनुप्रयोग को पहुंचनीयता व्यायाम का लाभ दें,Give your application an accessibility workout
1,एक्सेर्साइसर पहुंचनीयता अन्वेषक,Accerciser Accessibility Explorer
2,निचले पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the bottom panel
3,ऊपरी पटल के लिए डिफोल्ट प्लग-इन खाका,The default plugin layout for the top panel
4,उन प्लग-इनों की सूची जिन्हें डिफोल्ट रूप से निष्क्रिय किया गया है,A list of plugins that are disabled by default
5,अवधि को हाइलाइट रकें,Highlight duration
6,पहुंचनीय आसंधि (नोड) को चुनते समय हाइलाइट बक्से की अवधि,The duration of the highlight box when selecting accessible nodes
7,सीमांत (बोर्डर) के रंग को हाइलाइट करें,Highlight border color
8,हाइलाइट किए गए सीमांत का रंग और अपारदर्शिता।,The color and opacity of the highlight border.
9,भराई के रंग को हाइलाइट करें,Highlight fill color


In [13]:
pd.isnull(df).sum()

hindi      0
english    0
dtype: int64

In [14]:
df=df[~pd.isnull(df['english'])]

In [15]:
df.drop_duplicates(inplace=True)

* ### Let us pick any 25000 rows from the dataset.

In [16]:
df=df.sample(n=25000,random_state=42)
df.shape

(25000, 2)

In [17]:
# Lowercase all characters
df['english']=df['english'].apply(lambda x: x.lower())
df['hindi']=df['hindi'].apply(lambda x: x.lower())

In [18]:
# Remove quotes
df['english']=df['english'].apply(lambda x: re.sub("'", '', x))
df['hindi']=df['hindi'].apply(lambda x: re.sub("'", '', x))

In [19]:
exclude = set(string.punctuation) # Set of all special characters
# Remove all the special characters
df['english']=df['english'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))
df['hindi']=df['hindi'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))

In [20]:
# Remove all numbers from text
remove_digits = str.maketrans('', '', digits)
df['english']=df['english'].apply(lambda x: x.translate(remove_digits))
df['hindi']=df['hindi'].apply(lambda x: x.translate(remove_digits))

df['hindi'] = df['hindi'].apply(lambda x: re.sub("[२३०८१५७९४६]", "", x))

# Remove extra spaces
df['english']=df['english'].apply(lambda x: x.strip())
df['hindi']=df['hindi'].apply(lambda x: x.strip())
df['english']=df['english'].apply(lambda x: re.sub(" +", " ", x))
df['hindi']=df['hindi'].apply(lambda x: re.sub(" +", " ", x))


In [21]:
# Add start and end tokens to target sequences
df['hindi'] = df['hindi'].apply(lambda x : 'START_ '+ x + ' _END')

In [22]:
df.head()

Unnamed: 0,hindi,english
1255665,START_ महाराजा कॉलिज जयपुर _END,maharaja college jaipur
1238476,START_ इन गॉड वी ट्रस्ट _END,in god we trust
1411628,START_ वह नौकरी ढूँढ़ने में अपनी असफलता की चर्चा करता है अंत में वह न पास न फ़ीस के बाद पर चलायी जा रही एक कक्षा में विद्यार्थियों को पाठ रटा रहा होता है। _END,he recounts his unsuccessful attempts to get a job he is finally left running a crammer s class with the promise no pass no fees
1547009,START_ राजशाही साम्राज्य बहादुर शाह द्वितीय के बाद समाप्त हो गया जो सिपाहियों की बगावत में सहायता देने के संदेह पर ब्रिटिश राज द्वारा रंगून निर्वासित कर दिए गए थे। वहां में उनकी मृत्यु हो गई। _END,the imperial dynasty became extinct with bahadur shah ii who was deported to rangoon by the british on suspicion of assisting the sepoy mutineers he died there in
1344715,START_ संयुक्त राज्य अमेरिका के पूर्वोत्तर क्षेत्र में आए आंधीतूफान में सबसे कम पी एच नवम्बर में मापा गया था। _END,the lowest ph value recorded for a storm in northeastern united states was during november


In [23]:
### Get English and Hindi Vocabulary
all_eng_words=set()
for eng in df['english']:
    for word in eng.split():
        if word not in all_eng_words:
            all_eng_words.add(word)

all_hindi_words=set()
for hin in df['hindi']:
    for word in hin.split():
        if word not in all_hindi_words:
            all_hindi_words.add(word)

In [24]:
len(all_eng_words)

28326

In [25]:
len(all_hindi_words)

38586

In [26]:
df['length_eng_sentence']=df['english'].apply(lambda x:len(x.split(" ")))
df['length_hin_sentence']=df['hindi'].apply(lambda x:len(x.split(" ")))

In [27]:
df.head()

Unnamed: 0,hindi,english,length_eng_sentence,length_hin_sentence
1255665,START_ महाराजा कॉलिज जयपुर _END,maharaja college jaipur,3,5
1238476,START_ इन गॉड वी ट्रस्ट _END,in god we trust,4,6
1411628,START_ वह नौकरी ढूँढ़ने में अपनी असफलता की चर्चा करता है अंत में वह न पास न फ़ीस के बाद पर चलायी जा रही एक कक्षा में विद्यार्थियों को पाठ रटा रहा होता है। _END,he recounts his unsuccessful attempts to get a job he is finally left running a crammer s class with the promise no pass no fees,25,35
1547009,START_ राजशाही साम्राज्य बहादुर शाह द्वितीय के बाद समाप्त हो गया जो सिपाहियों की बगावत में सहायता देने के संदेह पर ब्रिटिश राज द्वारा रंगून निर्वासित कर दिए गए थे। वहां में उनकी मृत्यु हो गई। _END,the imperial dynasty became extinct with bahadur shah ii who was deported to rangoon by the british on suspicion of assisting the sepoy mutineers he died there in,28,37
1344715,START_ संयुक्त राज्य अमेरिका के पूर्वोत्तर क्षेत्र में आए आंधीतूफान में सबसे कम पी एच नवम्बर में मापा गया था। _END,the lowest ph value recorded for a storm in northeastern united states was during november,15,21


In [28]:
df[df['length_eng_sentence']>30].shape

(2528, 4)

In [29]:
df=df[df['length_eng_sentence']<=20]
df=df[df['length_hin_sentence']<=20]

In [30]:
df.shape

(17559, 4)

In [31]:
print("maximum length of Hindi Sentence ",max(df['length_hin_sentence']))
print("maximum length of English Sentence ",max(df['length_eng_sentence']))

maximum length of Hindi Sentence  20
maximum length of English Sentence  20


In [32]:
max_length_src=max(df['length_hin_sentence'])
max_length_tar=max(df['length_eng_sentence'])

In [33]:
input_words = sorted(list(all_eng_words))
target_words = sorted(list(all_hindi_words))
num_encoder_tokens = len(all_eng_words)
num_decoder_tokens = len(all_hindi_words)
num_encoder_tokens, num_decoder_tokens

(28326, 38586)

In [34]:
num_decoder_tokens += 1 #for zero padding


In [35]:
input_token_index = dict([(word, i+1) for i, word in enumerate(input_words)])
target_token_index = dict([(word, i+1) for i, word in enumerate(target_words)])

In [36]:
reverse_input_char_index = dict((i, word) for word, i in input_token_index.items())
reverse_target_char_index = dict((i, word) for word, i in target_token_index.items())

In [37]:
df = shuffle(df)
df.head(10)

Unnamed: 0,hindi,english,length_eng_sentence,length_hin_sentence
759649,START_ कशीदा _END,fancywork,1,3
571088,START_ वनस्पति _END,forest,1,3
528210,START_ एदेक _END,movement for social democracy,4,3
611869,START_ ऐसा कहा जाता है कि वे क़ुरान और भगवद् गीता दोनों का अध्यन करते हैं। _END,they say that he studies koran as well as bhagwad gita,11,17
851193,START_ शत _END,c,1,3
633393,START_ प्रदूषण फैलाने देता सिद्धांत polluter pays principle _END,doctrine on causes of pollution,5,9
969679,START_ डिस्क कैशिंग में लिखने के लिए डिस्क कैशे का इस्तेमाल किया जा रहा है। _END,disk cache is being used for writing in disk caching,10,16
481117,START_ सूची दृश्य के आरंभ करने में एक त्रुटि आई _END,the list view encountered an error while starting up,9,11
1250572,START_ परिचय _END,parichay,1,3
1212178,START_ हमारे सकल घरेलू उत्पादन मेंऔर में क्रमशः औरप्रतिशत वृद्धि हुई है। _END,fiscal expansionary measures helped in maintaining the growth momentum amidst downturn anxieties emanating from the global markets,17,13


### Split the data into train and test

In [38]:
X, y = df['english'], df['hindi']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,random_state=42)
X_train.shape, X_test.shape

((14047,), (3512,))

### Let us save this data

In [39]:
X_train.to_pickle('X_train.pkl')
X_test.to_pickle('X_test.pkl')


In [40]:
def generate_batch(X = X_train, y = y_train, batch_size = 128):
    ''' Generate a batch of data '''
    while True:
        for j in range(0, len(X), batch_size):
            encoder_input_data = np.zeros((batch_size, max_length_src),dtype='float32')
            decoder_input_data = np.zeros((batch_size, max_length_tar),dtype='float32')
            decoder_target_data = np.zeros((batch_size, max_length_tar, num_decoder_tokens),dtype='float32')
            for i, (input_text, target_text) in enumerate(zip(X[j:j+batch_size], y[j:j+batch_size])):
                for t, word in enumerate(input_text.split()):
                    encoder_input_data[i, t] = input_token_index[word] # encoder input seq
                for t, word in enumerate(target_text.split()):
                    if t<len(target_text.split())-1:
                        decoder_input_data[i, t] = target_token_index[word] # decoder input seq
                    if t>0:
                        # decoder target sequence (one hot encoded)
                        # does not include the START_ token
                        # Offset by one timestep
                        decoder_target_data[i, t - 1, target_token_index[word]] = 1.
            yield([encoder_input_data, decoder_input_data], decoder_target_data)

### Encoder-Decoder Architecture

In [41]:
latent_dim=300

In [42]:
# Encoder
encoder_inputs = Input(shape=(None,))
enc_emb =  Embedding(num_encoder_tokens, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]

In [43]:
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,))
dec_emb_layer = Embedding(num_decoder_tokens, latent_dim, mask_zero = True)
dec_emb = dec_emb_layer(decoder_inputs)
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(dec_emb,
                                     initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

In [44]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

In [45]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_1 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None)]       0           []                               
                                                                                                  
 embedding (Embedding)          (None, None, 300)    8497800     ['input_1[0][0]']                
                                                                                                  
 embedding_1 (Embedding)        (None, None, 300)    11576100    ['input_2[0][0]']                
                                                                                              

In [46]:
train_samples = len(X_train)
val_samples = len(X_test)
batch_size = 128
epochs = 100

In [48]:
model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
                    steps_per_epoch = train_samples//batch_size,
                    epochs=epochs,
                    validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
                    validation_steps = val_samples//batch_size)



  model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),


Epoch 1/100

InvalidArgumentError: Graph execution error:

Detected at node 'model/embedding/embedding_lookup' defined at (most recent call last):
    File "C:\Users\monas\anaconda3\lib\runpy.py", line 197, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\monas\anaconda3\lib\runpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "C:\Users\monas\anaconda3\lib\site-packages\traitlets\config\application.py", line 846, in launch_instance
      app.start()
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 677, in start
      self.io_loop.start()
    File "C:\Users\monas\anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\monas\anaconda3\lib\asyncio\base_events.py", line 601, in run_forever
      self._run_once()
    File "C:\Users\monas\anaconda3\lib\asyncio\base_events.py", line 1905, in _run_once
      handle._run()
    File "C:\Users\monas\anaconda3\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 471, in dispatch_queue
      await self.process_one()
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 460, in process_one
      await dispatch(*args)
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 367, in dispatch_shell
      await result
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 662, in execute_request
      reply_content = await reply_content
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 360, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "C:\Users\monas\anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 532, in run_cell
      return super().run_cell(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2863, in run_cell
      result = self._run_cell(
    File "C:\Users\monas\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2909, in _run_cell
      return runner(coro)
    File "C:\Users\monas\anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\monas\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3106, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\monas\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3309, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "C:\Users\monas\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3369, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\monas\AppData\Local\Temp\ipykernel_10376\2713035456.py", line 1, in <cell line: 1>
      model.fit_generator(generator = generate_batch(X_train, y_train, batch_size = batch_size),
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 2507, in fit_generator
      return self.fit(
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1564, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1160, in train_function
      return step_function(self, iterator)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1146, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1135, in run_step
      outputs = model.train_step(data)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 993, in train_step
      y_pred = self(x, training=True)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\functional.py", line 510, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\functional.py", line 667, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\layers\core\embedding.py", line 208, in call
      out = tf.nn.embedding_lookup(self.embeddings, inputs)
Node: 'model/embedding/embedding_lookup'
indices[72,0] = 28326 is not in [0, 28326)
	 [[{{node model/embedding/embedding_lookup}}]] [Op:__inference_train_function_13999]

In [51]:
model.fit(x = X_train,
          y = y_train,
          batch_size = batch_size,
          steps_per_epoch = train_samples//batch_size,
          epochs=epochs,
          validation_data = generate_batch(X_test, y_test, batch_size = batch_size),
          validation_steps = val_samples//batch_size)

Epoch 1/100


ValueError: in user code:

    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1160, in train_function  *
        return step_function(self, iterator)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1146, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 1135, in run_step  **
        outputs = model.train_step(data)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\training.py", line 993, in train_step
        y_pred = self(x, training=True)
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "C:\Users\monas\anaconda3\lib\site-packages\keras\engine\input_spec.py", line 216, in assert_input_compatibility
        raise ValueError(

    ValueError: Layer "model" expects 2 input(s), but it received 1 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, 1) dtype=string>]


In [None]:
model.save_weights('nmt_weights.h5')

In [None]:
# Encode the input sequence to get the "thought vectors"
encoder_model = Model(encoder_inputs, encoder_states)

# Decoder setup
# Below tensors will hold the states of the previous time step
decoder_state_input_h = Input(shape=(latent_dim,))
decoder_state_input_c = Input(shape=(latent_dim,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

dec_emb2= dec_emb_layer(decoder_inputs) # Get the embeddings of the decoder sequence

# To predict the next word in the sequence, set the initial states to the states from the previous time step
decoder_outputs2, state_h2, state_c2 = decoder_lstm(dec_emb2, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2) # A dense softmax layer to generate prob dist. over the target vocabulary

# Final decoder model
decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states2)


In [None]:
def decode_sequence(input_seq):
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1,1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = target_token_index['START_']

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ''
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = reverse_target_char_index[sampled_token_index]
        decoded_sentence += ' '+sampled_char

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_char == '_END' or
           len(decoded_sentence) > 50):
            stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1,1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    return decoded_sentence

In [None]:
train_gen = generate_batch(X_train, y_train, batch_size = 1)
k=-1


In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])

In [None]:
k+=1
(input_seq, actual_output), _ = next(train_gen)
decoded_sentence = decode_sequence(input_seq)
print('Input English sentence:', X_train[k:k+1].values[0])
print('Actual Hindi Translation:', y_train[k:k+1].values[0][6:-4])
print('Predicted Hindi Translation:', decoded_sentence[:-4])