<a href="https://colab.research.google.com/github/krishnarevi/TSAI_END2.0_Session11/blob/main/NLP%20From%20Scratch%20Translation%20With%20A%20Sequence%20To%20Sequence%20Network%20And%20Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## NLP FROM SCRATCH: TRANSLATION WITH A SEQUENCE TO SEQUENCE NETWORK AND ATTENTION

Data preprocessing

In [150]:
%matplotlib inline

from __future__ import unicode_literals, print_function, division
from io import open
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

!wget https://download.pytorch.org/tutorial/data.zip

!unzip data.zip

--2021-07-24 14:58:43--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 13.227.211.3, 13.227.211.25, 13.227.211.92, ...
Connecting to download.pytorch.org (download.pytorch.org)|13.227.211.3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip.2’


2021-07-24 14:58:43 (81.0 MB/s) - ‘data.zip.2’ saved [2882130/2882130]

Archive:  data.zip
replace data/eng-fra.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [151]:
SOS_token = 0
EOS_token = 1

class Lang:
  def __init__(self,name):
    self.name = name
    self.word2index = {}
    self.word2count = {}
    self.index2word = {0:'SOS',1: 'EOS' }
    self.n_words = 2 # Count SOS and EOS 

  #tokenize given sentence
  def addSentence(self,sentence):
    for word in sentence.split(' '):
      self.addWord(word)

  # create vocabulary
  def addWord(self,word):
    if word not in self.word2index: #if word is not present in word2index dictionary
      self.word2index[word]=self.n_words
      self.word2count[word]=1
      self.index2word[self.n_words]=word
      self.n_words += 1
    else :
      self.word2count[word] += 1

# Turn a Unicode string to plain ASCII, thanks to
# https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

# Lowercase, trim, and remove non-letter characters
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

def readLangs(lang1, lang2, reverse=False):
    print("Reading lines...")

    # Read the file and split into lines
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8').\
        read().strip().split('\n')

    # Split every line into pairs and normalize
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs
MAX_LENGTH = 10

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s ",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)


def filterPair(p):
    return len(p[0].split(' ')) < MAX_LENGTH and \
        len(p[1].split(' ')) < MAX_LENGTH and \
        p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

# apply filterpair ,readLangs create vocabulary from pairs
def prepareData(lang1,lang2,reverse = False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    print("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    print("Trimmed to %s sentence pairs" % len(pairs))
    print("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    print("Counted words:")
    print(input_lang.name, input_lang.n_words)
    print(output_lang.name, output_lang.n_words)
    return input_lang, output_lang, pairs
input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
print(random.choice(pairs))

Reading lines...
Read 135842 sentence pairs
Trimmed to 10599 sentence pairs
Counting words...
Counted words:
fra 4345
eng 2803
['nous sommes tous des laches .', 'we re all cowards .']


After all of these steps, our data is ready. Let's explore it a bit

In [152]:
pairs[15:20]

[['je suis revenu .', 'i m back .'],
 ['me revoila .', 'i m back .'],
 ['je suis chauve .', 'i m bald .'],
 ['je suis occupe .', 'i m busy .'],
 ['je suis occupee .', 'i m busy .']]

In [153]:
type(pairs)

list

Cool, so our data is actually a list of list, and each innter list element consists of fra to eng translation. 

# The architecture we are building

![image](https://miro.medium.com/max/1838/1*tXchCn0hBSUau3WO0ViD7w.jpeg)

As we can see here, we will have an encoder, an attention mechanism block and decoder. In the final code the attention mechanicm block and decoder will be merged into single block as we need both to work together. 

As we can see here, we need to create a copy of h1, h2, h3 and h4. These are encoder outputs for a sentence with 4 words. 

# Encoder

We will build our encoder with a LSTM, but that's all we know. Let's NOT strait away build a class, but see how to come up with one for the Encoder. We need to answer few questions first:
1. what would be the hidden size of our LSTM
2. What would be the input size
3. What would be the embedding dimesions. 

For simplicity, lets keep 1. and 3. to be 256. 

We can't feed our input directly to LSTM, we need to tensorize it, convert to embeddings first. 

`embedding = nn.Embedding(input_size, hidden_size) `

## What is input_size?

Remember the line below?

`input_lang, output_lang, pairs = prepareData('eng', 'fra', True)`

We want to create an embedding layer that has embedding value for each of the word we have in out input layer.Also we will create an encoder LSTM layer.We need to place each layer we create in GPU

In [154]:
input_size = input_lang.n_words
embedding_size =256
hidden_size = 256
embedding = nn. Embedding(input_size,embedding_size).to(device)
lstm = nn. LSTM(embedding_size, hidden_size ).to(device)

Cool, now we need to feed data to our LSTM. We have input in the form of pairs already. Let's start from there. 

In [155]:
sample = random.choice(pairs)
sample

['je ne vais pas jouer a ce jeu .', 'i m not going to play this game .']

Let's tokenize, convert into indices and append EOS token to input, output sample

In [156]:
input_sentence = sample[0]
output_sentence = sample[1]
input_indices = [input_lang.word2index[x] for x in input_sentence.split(' ')]
output_indices = [output_lang.word2index[x] for x in output_sentence.split(' ')]
input_indices.append(EOS_token)
output_indices.append(EOS_token)
input_indices,output_indices

([6, 297, 7, 246, 2194, 115, 528, 2568, 5, 1],
 [2, 3, 147, 61, 532, 2070, 797, 1519, 4, 1])

Let's convert input and output samples to tensor before feeding to embedding layer

In [157]:
input_tensor =torch.tensor(input_indices,dtype=torch.long ,device=device)
output_tensor = torch.tensor(output_indices,dtype=torch.long,device=device)
input_tensor,output_tensor 

(tensor([   6,  297,    7,  246, 2194,  115,  528, 2568,    5,    1],
        device='cuda:0'),
 tensor([   2,    3,  147,   61,  532, 2070,  797, 1519,    4,    1],
        device='cuda:0'))

In [158]:
device

device(type='cuda')

We are working with 1 sample, but we would be working for a batch. Let's fix that by converting our input_tensor into a fake batch

In [159]:
embedded_input = embedding(input_tensor[0].view(-1,1)) #first word only
embedded_input.shape

torch.Size([1, 1, 256])

Let's build our LSTM now

In [161]:

encoder_hidden,encoder_cell = torch.zeros((1,1,256),device=device), torch.zeros((1,1,256),device=device) # initialize encoder hidden and cell state with zero tensor
encoder_outputs = torch.zeros(MAX_LENGTH, 256, device=device)


for i in range(input_tensor.shape[0]) :
    embedded_input = embedding(input_tensor[i].view(-1, 1))
    output, (encoder_hidden,encoder_cell) = lstm(embedded_input, (encoder_hidden,encoder_cell))
    encoder_outputs[i] += output[0,0]

    print('\033[1m' +"Time step {}  \033[0m".format(i))
    if (i<input_tensor.shape[0]-1):
      print('Actual input word = {}'.format(input_sentence.split(" ")[i]))
    else:
      print('Actual input word = {}'.format("<EOS>"))
    print('Embedding of input word {} = {}'.format(i, output[0,0]))
    print('Encoder output at this time step = {}'.format(output[0,0]))

    print("-----------------------------------------------------")



[1mTime step 0  [0m
Actual input word = je
Embedding of input word 0 = tensor([-0.0077, -0.0126, -0.0797, -0.2254,  0.2022, -0.0396,  0.1483, -0.1217,
        -0.0240,  0.0375, -0.1355,  0.0307, -0.0029, -0.1346,  0.0156,  0.1302,
         0.0217, -0.1207,  0.0344,  0.2652, -0.2008, -0.0738, -0.1199,  0.0748,
         0.0500, -0.0573, -0.0034,  0.0999, -0.0279,  0.1133, -0.0982,  0.0235,
        -0.1620,  0.2825,  0.0005,  0.0184,  0.1625,  0.0254, -0.0923,  0.0996,
         0.0432,  0.0148,  0.0212, -0.1219, -0.1130,  0.0411,  0.0106,  0.0193,
        -0.1754, -0.0707, -0.0491, -0.0064,  0.0845,  0.1418,  0.0909,  0.1922,
         0.2086,  0.0315,  0.0747,  0.0184,  0.0332,  0.0288,  0.0548, -0.1106,
         0.0132, -0.1752, -0.1316, -0.1528, -0.0625, -0.0524,  0.0842,  0.3359,
         0.0046,  0.0319,  0.0549,  0.1653,  0.1270, -0.1793,  0.0486,  0.1240,
        -0.0778, -0.0685, -0.0747, -0.2492,  0.0661,  0.0679,  0.1164,  0.1155,
        -0.1437, -0.0858,  0.0388,  0.1976,  0.

In [162]:
encoder_outputs.shape, encoder_hidden.shape

(torch.Size([10, 256]), torch.Size([1, 1, 256]))

In [163]:
encoder_outputs[0:2]

tensor([[-7.7404e-03, -1.2574e-02, -7.9709e-02, -2.2541e-01,  2.0221e-01,
         -3.9594e-02,  1.4831e-01, -1.2166e-01, -2.3984e-02,  3.7537e-02,
         -1.3545e-01,  3.0687e-02, -2.9208e-03, -1.3461e-01,  1.5627e-02,
          1.3022e-01,  2.1747e-02, -1.2067e-01,  3.4441e-02,  2.6524e-01,
         -2.0081e-01, -7.3790e-02, -1.1992e-01,  7.4761e-02,  4.9959e-02,
         -5.7309e-02, -3.3934e-03,  9.9923e-02, -2.7905e-02,  1.1331e-01,
         -9.8245e-02,  2.3474e-02, -1.6201e-01,  2.8248e-01,  4.5370e-04,
          1.8438e-02,  1.6254e-01,  2.5445e-02, -9.2324e-02,  9.9566e-02,
          4.3235e-02,  1.4791e-02,  2.1224e-02, -1.2188e-01, -1.1304e-01,
          4.1086e-02,  1.0605e-02,  1.9316e-02, -1.7542e-01, -7.0712e-02,
         -4.9099e-02, -6.4439e-03,  8.4479e-02,  1.4176e-01,  9.0914e-02,
          1.9221e-01,  2.0863e-01,  3.1516e-02,  7.4726e-02,  1.8408e-02,
          3.3193e-02,  2.8756e-02,  5.4822e-02, -1.1065e-01,  1.3199e-02,
         -1.7515e-01, -1.3164e-01, -1.


Cool! Next let's build out Decoder where we have attention in-built.

# Decoder with Attention

Here is the plan. 

1. First input to the decoder will be SOS_token, later inputs would be the words it predicted (unless we implement teacher forcing)
2. decoder/LSTM's hidden state will be initialized with the encoder's last hidden state
3. we will use LSTM's hidden state and last prediction to generate attention weight using a FC layer. 
4. this attention weight will be used to weigh the encoder_outputs using batch matric multiplication. This will give us a NEW view on how to look at encoder_states.
5. this attention applied encoder_states will then be concatenated with the input, and then sent a linear layer and _then_ sent to the LSTM. 
6. LSTM's output will be sent to a FC layer to predict one of the output_language words

Let's prepare all the inputs we need to do this

In [164]:
decoder_input = torch.tensor([[SOS_token]],device = device) #First input to the decoder will be SOS_token
decoder_hidden,decoder_cell = encoder_hidden,encoder_cell #decoder/LSTM's hidden state will be initialized with the encoder's last hidden state,similarly cell state is also initiated
decoder_hidden.shape,decoder_input.shape


(torch.Size([1, 1, 256]), torch.Size([1, 1]))

Let's create embedding layer

In [165]:
output_size = output_lang.n_words
embedding = nn. Embedding(output_size,256).to(device)
embedded = embedding(decoder_input)
embedded.shape,output_size

(torch.Size([1, 1, 256]), 2803)

we will use LSTM's hidden state and last prediction to generate attention weight using a FC layer.

In [166]:
attn_weights_layer = nn.Linear(256 * 2, MAX_LENGTH).to(device)#since input to attention weights will be concatenated tensor of previous hidden state and output
decoder_hidden.shape,embedded.shape# since previous output is current input


(torch.Size([1, 1, 256]), torch.Size([1, 1, 256]))

In [167]:
decoder_hidden[0].shape

torch.Size([1, 256])

In [168]:
torch.cat((decoder_hidden[0],embedded[0]), dim =1).shape

torch.Size([1, 512])

In [169]:
attn_weights = attn_weights_layer(torch.cat((decoder_hidden[0],embedded[0]), dim =1))

attn_weights.shape,attn_weights

(torch.Size([1, 10]),
 tensor([[ 0.1070,  0.7801,  0.4179, -0.5705, -0.1452, -0.0084,  0.1125, -0.2949,
          -0.4868,  0.6633]], device='cuda:0', grad_fn=<AddmmBackward>))

Will take softmax of these weights

In [170]:
attn_weights = F.softmax(attn_weights, dim = 1)
attn_weights

tensor([[0.0955, 0.1872, 0.1303, 0.0485, 0.0742, 0.0851, 0.0960, 0.0639, 0.0527,
         0.1666]], device='cuda:0', grad_fn=<SoftmaxBackward>)

Next , let's apply attention weights to encoder outputs to see which word has to concentrate during each time step.Before that let's reshape attn_weights and encoder_output as matrix multiplication will accept only 3D tensors as input

In [171]:
attn_weights.shape,encoder_outputs.shape

(torch.Size([1, 10]), torch.Size([10, 256]))

In [172]:
attn_weights.unsqueeze(0).shape,encoder_outputs.unsqueeze(0).shape

(torch.Size([1, 1, 10]), torch.Size([1, 10, 256]))

In [173]:
attn_applied = torch.bmm(attn_weights.unsqueeze(0),encoder_outputs.unsqueeze(0))
attn_applied.shape

torch.Size([1, 1, 256])

So, now we have this 256dm attn_applied encoder_outputs capturing what we should focus on on this step. We also have the input we already generated. That's 256dm again. LSTM is gonna take 256 only. So we need to concatenate them, send to a linear layer to reduce dimensions, and then send to LSTM

In [174]:
input_layer_lstm = nn.Linear(256 * 2 , 256).to(device)
input_to_lstm = input_layer_lstm(torch.cat((embedded[0], attn_applied[0]),dim = 1))
input_to_lstm = input_to_lstm.unsqueeze(0)
decoder_hidden.shape, input_to_lstm.shape


(torch.Size([1, 1, 256]), torch.Size([1, 1, 256]))

Now let's build our decoder LSTM

In [175]:
lstm = nn.LSTM(256,256).to(device)
decoder_output,(decoder_hidden , decoder_cell) = lstm( input_to_lstm, (decoder_hidden , decoder_cell))
decoder_output.shape,decoder_hidden.shape

(torch.Size([1, 1, 256]), torch.Size([1, 1, 256]))

In [176]:
import torch.nn.functional as F
output_word_layer = nn.Linear(256, output_lang.n_words).to(device)

In [177]:
output = F.relu(decoder_output)
output = F.softmax(output_word_layer(output[0]), dim = 1)
output, output.shape

(tensor([[0.0004, 0.0003, 0.0004,  ..., 0.0004, 0.0004, 0.0003]],
        device='cuda:0', grad_fn=<SoftmaxBackward>), torch.Size([1, 2803]))

In [178]:
output.data.topk(1)

torch.return_types.topk(values=tensor([[0.0004]], device='cuda:0'), indices=tensor([[1881]], device='cuda:0'))

Says highest value is 0.0004 and it is present at idx 1301

In [179]:
top_value, top_id =output.data.topk(1)
top_word = output_lang.index2word[top_id.item()]
top_value.item(), top_id.item(), top_word

(0.0004146175633650273, 1881, 'or')

 Let's combine all of these steps, we we have just processed 1 input till now.

In [180]:
decoder_input = torch.tensor([[SOS_token]],device = device) #First input to the decoder will be SOS_token
decoder_hidden,decoder_cell = encoder_hidden,encoder_cell
output_size = output_lang.n_words
embedding = nn.Embedding(output_size, 256).to(device)
embedded = embedding(decoder_input)
attn_weight_layer = nn.Linear(256 * 2, MAX_LENGTH).to(device)
attn_weights = attn_weight_layer(torch.cat((embedded[0], decoder_hidden[0]), 1))
attn_weights = F.softmax(attn_weights, dim = 1)
attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))

input_layer_lstm = nn.Linear(256 * 2 , 256).to(device)
input_to_lstm = input_layer_lstm(torch.cat((embedded[0], attn_applied[0]),dim = 1))
input_to_lstm = input_to_lstm.unsqueeze(0)
lstm = nn.LSTM(256,256).to(device)
decoder_output,(decoder_hidden , decoder_cell) = lstm( input_to_lstm, (decoder_hidden , decoder_cell))
output_word_layer = nn.Linear(256, output_lang.n_words).to(device)
output = F.relu(decoder_output)
output = F.softmax(output_word_layer(output[0]), dim = 1)
top_value, top_id= output.data.topk(1)
output_lang.index2word[top_id.item()]

'he'

In [181]:
embedding = nn.Embedding(output_size, 256).to(device)
attn_weight_layer = nn.Linear(256 * 2, 10).to(device)
input_layer_lstm = nn.Linear(256 * 2 , 256).to(device)
lstm = nn.LSTM(256,256).to(device)
output_word_layer = nn.Linear(256, output_lang.n_words).to(device)

decoder_input = torch.tensor([[SOS_token]],device = device) #First input to the decoder will be SOS_token
decoder_hidden,decoder_cell = encoder_hidden,encoder_cell
output_size = output_lang.n_words
embedded = embedding(decoder_input)
attn_weights = attn_weight_layer(torch.cat((embedded[0], decoder_hidden[0]), 1))
attn_weights = F.softmax(attn_weights, dim = 1)
attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))

input_to_lstm = input_layer_lstm(torch.cat((embedded[0], attn_applied[0]),dim = 1))
input_to_lstm = input_to_lstm.unsqueeze(0)
decoder_output,(decoder_hidden , decoder_cell) = lstm( input_to_lstm, (decoder_hidden , decoder_cell))
output = F.relu(decoder_output)
output = F.softmax(output_word_layer(output[0]), dim = 1)
top_value, top_id= output.data.topk(1)
output_lang.index2word[top_id.item()], attn_weights

('war',
 tensor([[0.0684, 0.1058, 0.0787, 0.0922, 0.0985, 0.1298, 0.1720, 0.0598, 0.0686,
          0.1262]], device='cuda:0', grad_fn=<SoftmaxBackward>))

We will past last predicted word as input here , No teacher forcing

In [182]:

decoder_input = torch.tensor([[top_id.item()]],device = device) #First input to the decoder will be SOS_token
decoder_hidden,decoder_cell = encoder_hidden,encoder_cell
output_size = output_lang.n_words
embedded = embedding(decoder_input)
attn_weights = attn_weight_layer(torch.cat((embedded[0], decoder_hidden[0]), 1))
attn_weights = F.softmax(attn_weights, dim = 1)
attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))

input_to_lstm = input_layer_lstm(torch.cat((embedded[0], attn_applied[0]),dim = 1))
input_to_lstm = input_to_lstm.unsqueeze(0)
decoder_output,(decoder_hidden , decoder_cell) = lstm( input_to_lstm, (decoder_hidden , decoder_cell))
output = F.relu(decoder_output)
output = F.softmax(output_word_layer(output[0]), dim = 1)
top_value, top_id= output.data.topk(1)
output_lang.index2word[top_id.item()], attn_weights

('acrobat',
 tensor([[0.1060, 0.1252, 0.1086, 0.0914, 0.0873, 0.1351, 0.0602, 0.0621, 0.0535,
          0.1706]], device='cuda:0', grad_fn=<SoftmaxBackward>))

In [183]:

decoder_input = torch.tensor([[top_id.item()]],device = device) 
decoder_hidden,decoder_cell = encoder_hidden,encoder_cell
output_size = output_lang.n_words
embedded = embedding(decoder_input)
attn_weights = attn_weight_layer(torch.cat((embedded[0], decoder_hidden[0]), 1))
attn_weights = F.softmax(attn_weights, dim = 1)
attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))

input_to_lstm = input_layer_lstm(torch.cat((embedded[0], attn_applied[0]),dim = 1))
input_to_lstm = input_to_lstm.unsqueeze(0)
decoder_output,(decoder_hidden , decoder_cell) = lstm( input_to_lstm, (decoder_hidden , decoder_cell))
output = F.relu(decoder_output)
output = F.softmax(output_word_layer(output[0]), dim = 1)
top_value, top_id= output.data.topk(1)
output_lang.index2word[top_id.item()], attn_weights

('easily',
 tensor([[0.0852, 0.0970, 0.0521, 0.1161, 0.1062, 0.1120, 0.0884, 0.0575, 0.0976,
          0.1878]], device='cuda:0', grad_fn=<SoftmaxBackward>))

Now let's apply full teacher forcing, we will be sending output indices as inputs to decoder

In [184]:
pred = []
pred_idx = []
for i in range(4):
  decoder_input = torch.tensor([[output_indices[i]]],device = device) 
  decoder_hidden,decoder_cell = encoder_hidden,encoder_cell
  output_size = output_lang.n_words
  embedded = embedding(decoder_input)
  attn_weights = attn_weight_layer(torch.cat((embedded[0], decoder_hidden[0]), 1))
  attn_weights = F.softmax(attn_weights, dim = 1)
  attn_applied = torch.bmm(attn_weights.unsqueeze(0), encoder_outputs.unsqueeze(0))

  input_to_lstm = input_layer_lstm(torch.cat((embedded[0], attn_applied[0]),dim = 1))
  input_to_lstm = input_to_lstm.unsqueeze(0)
  decoder_output,(decoder_hidden , decoder_cell) = lstm( input_to_lstm, (decoder_hidden , decoder_cell))
  output = F.relu(decoder_output)
  output = F.softmax(output_word_layer(output[0]), dim = 1)
  top_value, top_id= output.data.topk(1)
  output_lang.index2word[top_id.item()], attn_weights
  pred.append(output_lang.index2word[top_id.item()])
  pred_idx.append(top_id.item())
  print('\033[1m' +"iteration {}  \033[0m".format(i))
  print('Actual output word = {}'.format(output_sentence.split(" ")[i]))
  print('Predicted output word = {}'.format(output_lang.index2word[top_id.item()]))
  print('Actual output index = {}'.format(output_indices[i]))
  print('Predicted output word = {}'.format( top_id.item()))
  print("attention weights are as follows , {}".format(attn_weights))
  print("-----------------------------------------------------")

[1miteration 0  [0m
Actual output word = i
Predicted output word = shizuoka
Actual output index = 2
Predicted output word = 1020
attention weights are as follows , tensor([[0.0839, 0.0805, 0.1422, 0.1088, 0.0913, 0.0734, 0.0709, 0.1302, 0.1144,
         0.1044]], device='cuda:0', grad_fn=<SoftmaxBackward>)
-----------------------------------------------------
[1miteration 1  [0m
Actual output word = m
Predicted output word = husband
Actual output index = 3
Predicted output word = 586
attention weights are as follows , tensor([[0.0389, 0.1585, 0.0543, 0.0873, 0.1292, 0.1652, 0.1287, 0.0957, 0.0567,
         0.0854]], device='cuda:0', grad_fn=<SoftmaxBackward>)
-----------------------------------------------------
[1miteration 2  [0m
Actual output word = not
Predicted output word = they
Actual output index = 147
Predicted output word = 221
attention weights are as follows , tensor([[0.0537, 0.1014, 0.0668, 0.0881, 0.1242, 0.0896, 0.0825, 0.1809, 0.0720,
         0.1408]], device='c

In [185]:
  print('Actual input sentence = {}'.format(input_sentence))
  print('Actual output sentence = {}'.format(output_sentence))
  print('Actual output indices = {}'.format(output_indices))
  print('Predicted output sentence = {}'.format(" ".join(pred)))
  print('Predicted output indices = {}'.format(pred_idx))
  

Actual input sentence = je ne vais pas jouer a ce jeu .
Actual output sentence = i m not going to play this game .
Actual output indices = [2, 3, 147, 61, 532, 2070, 797, 1519, 4, 1]
Predicted output sentence = shizuoka husband they apple
Predicted output indices = [1020, 586, 221, 1188]
