## Building a Chatbot: PyTorch

Note: This code has been adopted from https://pytorch.org/tutorials/beginner/chatbot_tutorial.html. We will demonstrate more details and observe the output setp-by-step to have a deeper understanding.

We can follow the explaination in the video: https://www.youtube.com/watch?v=CNuI8OWsppg

In [135]:
import torch
import torch.nn as nn                   # neureal-network package
from torch import optim                 # optimazers
import torch.nn.functional as F         # functions like ReLu and another functionalities
import csv
import random
import re                               # regular expression library
import os
import unicodedata
import codecs
import itertools

In [136]:
CUDA = torch.cuda.is_available()   # True or False
device = torch.device("cuda" if CUDA else "cpu")

## Part 1: DataPreprocessing

Download data from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

In [137]:
lines_filepath = os.path.join("cornell movie-dialogs corpus","movie_lines.txt")   #("folder_name", "filename.txt")
conv_filepath = os.path.join("cornell movie-dialogs corpus","movie_conversations.txt")

In [138]:
#Visualize some lines
with open(lines_filepath, 'r') as file:
    lines = file.readlines()
for line in lines[:8]:    # loop for first 8 lines
    print(line.strip())   # print the frist 8 lines

# LineID - CharacterID - MovieID - CharacterName - Speach of Character

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.
L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No


## Processing DataSet - Part 1

In [139]:
# Splits each line of the file into a dictionary of fields (lineID, characterID, movieID, character, text)
line_fields = ["lineID", "characterID", "movieID", "character", "text"]
lines = {}  # empty dictionary
with open(lines_filepath, 'r', encoding = 'iso-8859-1') as f:
    for line in f:
        values = line.split(" +++$+++ ")
        # Extract fields
        lineObj = {}   #temporary dictionary that resets at each loop (each line of the text)
        for i, field in enumerate(line_fields):
            lineObj[field] = values[i]
        lines[lineObj['lineID']] = lineObj
        
# Each line of lines we have a dictionary with the data for each LineID
lines

{'L1045': {'lineID': 'L1045',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'They do not!\n'},
 'L1044': {'lineID': 'L1044',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'They do to!\n'},
 'L985': {'lineID': 'L985',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'I hope so.\n'},
 'L984': {'lineID': 'L984',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'She okay?\n'},
 'L925': {'lineID': 'L925',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': "Let's go.\n"},
 'L924': {'lineID': 'L924',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'Wow\n'},
 'L872': {'lineID': 'L872',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': "Okay -- you're gonna need to learn how to lie.\n"},
 'L871': {'lineID': 'L871',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'No

In [140]:
# Illustrate the enumerate
line_fields = ["lineID", "characterID", "movieID", "character", "text"]
for i, field in enumerate(line_fields):
    print(i,field)


0 lineID
1 characterID
2 movieID
3 character
4 text


## Processing DataSet - Part 2
### movie_conversations.txt


In [141]:
# groups fields of lines from 'LoadLines' into conversations based on "movie-conversations.txt"


# characterID1 - actor 1
# characterID2 - actor 2
# movieID - movie ID
# utteranceIDs - lines of dialogue between two actors

conv_fields = ["characterID1", "characterID2", " movieID", "utteranceIDs"]
conversations = []
with open(conv_filepath, 'r', encoding = "iso-8859-1") as f:
    for line in f:
        values = line.split(" +++$+++ ")
        # Extract fields
        convObj = {}   # empty dictionary
        for i, field in enumerate(conv_fields):
            convObj[field] = values[i]
        # Convert string result from split to list, since convObj["utteranceIDs"] == ['L598485', 'L598486', ...]
        lineIDs = eval(convObj["utteranceIDs"])
        
        #Reassemble lines
        convObj["lines"] = []
        for lineID in lineIDs:
            convObj["lines"].append(lines[lineID])
        conversations.append(convObj)
    

In [142]:
conversations[0]

{'characterID1': 'u0',
 'characterID2': 'u2',
 ' movieID': 'm0',
 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n",
 'lines': [{'lineID': 'L194',
   'characterID': 'u0',
   'movieID': 'm0',
   'character': 'BIANCA',
   'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
  {'lineID': 'L195',
   'characterID': 'u2',
   'movieID': 'm0',
   'character': 'CAMERON',
   'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
  {'lineID': 'L196',
   'characterID': 'u0',
   'movieID': 'm0',
   'character': 'BIANCA',
   'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
  {'lineID': 'L197',
   'characterID': 'u2',
   'movieID': 'm0',
   'character': 'CAMERON',
   'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}]}

## Processing DataSet - Part 3

Put together the questions and answers, combine them, to process de enconder and decoder

We transform the dataset in form to questions and answers.

We take the question and the respective reply and store them in the list.

In [143]:
# Extract the data of the dictionary lines of the first element 
conversations[0]["lines"]

[{'lineID': 'L194',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
 {'lineID': 'L195',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
 {'lineID': 'L196',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
 {'lineID': 'L197',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}]

In [144]:
# Check how many dialogues we have in the first element
len(conversations[0]["lines"])

4

In [145]:
# Extract the data of the first conversation in the first element
conversations[0]["lines"][0]

{'lineID': 'L194',
 'characterID': 'u0',
 'movieID': 'm0',
 'character': 'BIANCA',
 'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'}

In [146]:
# Extract the text of the first conversation in the first element
conversations[0]["lines"][0]["text"]

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'

In [147]:
conversations[0]["lines"][0]["text"].strip() # this eliminate the \n and we got the pure text

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.'

In [148]:
# Extract pairs of sentences from conversation
qa_pairs = []

for conversation in conversations:
    # Iterate over all the lines of conversation
    for i in range(len(conversation["lines"])-1):
        inputLine = conversation["lines"][i]["text"].strip()
        targetLine = conversation["lines"][i+1]["text"].strip()
        #Filter wrong samples (if one of the lists is empty)
        if inputLine and targetLine:
            qa_pairs.append([inputLine, targetLine])

In [149]:
qa_pairs[1]

["Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.']

## Processing DataSet - Part 4

Save the dialogue in a text file to not be processed many times

In [150]:
# Define path to new file
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
delimiter =  "\t" # the pair question-answer will be delimited by TAB
#Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode-escape"))

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding = "utf-8") as outputfile:
    writer = csv.writer(outputfile, delimiter = delimiter)
    for pair in qa_pairs:
        writer.writerow(pair)
print("Done writing to file")


Writing newly formatted file...
Done writing to file


In [151]:
# Visualize some lines
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
with open(datafile, 'rb') as file:
    lines = file.readlines()

for line in lines[:8]:
    print(line)

b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\r\r\n"
b'Why?\tU

## Processing the Words

In [152]:
PAD_token = 0 # Used for padding short sentences
SOS_token = 1 # Start-of-sentence token <START>
EOS_token = 2 # End-os-sentence token <END>

class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.word2index = {}   # key-value pair (for word: car - 10, road - 34)
        self.word2count = {}   # counts the frequency of the words
        self.index2word = {PAD_token: "PAD", SOS_token:"SOS", EOS_token:"EOS"}
        self.num_words = 3 #Count PAD, SOS, EOS  #initializatoin f the numbre of the words 
    
    # Next we define methods
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1
            
    # Remove words below a certain cound threshold
    def trim(self, min_count):
        keep_words = []
        for k, v in self.word2count.items():   # loop for each key and values
            if v >= min_count:
                keep_words.append(k)
        
        print("keep_words {} / {} = {:.4f}".format(len(keep_words), 
                                                   len(self.word2index), 
                                                   len(keep_words) / len(self.word2index)))
        # reinitialize dictionaries
        self.word2index = {}   # key-value pair (for word: car - 10, road - 34)
        self.word2count = {}   # counts the frequency of the words
        self.index2word = {PAD_token: "PAD", SOS_token:"SOS", EOS_token:"EOS"}
        self.num_words = 3 #Count PAD, SOS, EOS  #initializatoin f the numbre of the words 
        
        # here we only keep the words that have value >= min_count
        for word in keep_words:
            self.addWord(word)

## Preprocessing: Remove punctuations and signs

In [153]:
# Turn a Unicode string to plain ASCII
def unicodeToAscii(s):
    return ''.join(c for c in unicodedata.normalize('NFD',s) if unicodedata.category(c) != 'Mn')

# Mn - non mark space

In [154]:
# testing the function unicodeAscii
unicodeToAscii("Montréal, Françoise....")

'Montreal, Francoise....'

In [155]:
# Lower case, trim white spaces, lines, ...etc, and remove non-letter characters.
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    # replace any .!? by a whitespace + character --> '!' = ' ! ' \1 menas the first bracket group --> [.!?].
    # r is to not consider \1 as a character (r to escape a backslash). + means one or more
    # re = regular expression
    s = re.sub(r"([.!?])", r" \1", s) #substitute this characters to the character and space
    # remove any character that is not a sequence of lower or upper case letters
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    # remove a sequence if whitespace character
    s = re.sub(r"\s+", r" ", s).strip() #remove lenf and right spaces
    return s 

In [156]:
# testing the function
normalizeString("aa123aa!s's   dd?")

'aa aa !s s dd ?'

## Processing text

In [157]:
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
# read the file and split into lines
print("Reading and processing file....Please Wait")
lines = open(datafile, 'r', encoding = "utf-8").read().strip().split('\n')
# Split every line into pais and normalize
pairs = [[normalizeString(s) for s in pair.split('\t')] for pair in lines]
print("Done reading !")
voc = Vocabulary("cornell movie-dialogs corpus")

Reading and processing file....Please Wait
Done reading !


Explanation of the 

    pairs = [[normalizeString(s) for s in pair.split('\t')] for pair in lines]

In [158]:
lines[0].split('\t')

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you."]

In [159]:
normalizeString(lines[0].split('\t')[0])

'can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .'

In [160]:
normalizeString(lines[0].split('\t')[1])

'well i thought we d start with pronunciation if that s okay with you .'

In [161]:
pairs[0]

['can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .',
 'well i thought we d start with pronunciation if that s okay with you .']

### Limit the length of sentences 

In [162]:
# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold
MAX_LENGTH = 10 # maximum sentence length to consider (max_words)
def filterPair(p):  #each pair is a list of two elements
    # Input sequences need to perserve the last word for EOS token
    return len(p[0].split()) < MAX_LENGTH and len(p[1].split()) < MAX_LENGTH

# len(pairs[0][0].split())
# len(pairs[0][1].split())

# Filter pairs using filterpairs condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

In [163]:
pairs

[['can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .',
  'well i thought we d start with pronunciation if that s okay with you .'],
 [''],
 ['well i thought we d start with pronunciation if that s okay with you .',
  'not the hacking and gagging and spitting part . please .'],
 [''],
 ['not the hacking and gagging and spitting part . please .',
  'okay . . . then how bout we try out some french cuisine . saturday ? night ?'],
 [''],
 ['you re asking me out . that s so cute . what s your name again ?',
  'forget it .'],
 [''],
 ['no no it s my fault we didn t have a proper introduction', 'cameron .'],
 [''],
 ['cameron .',
  'the thing is cameron i m at the mercy of a particularly hideous breed of loser . my sister . i can t date until she does .'],
 [''],
 ['the thing is cameron i m at the mercy of a particularly hideous breed of loser . my sister . i can t date until she does .',
  'seems like she coul

In [164]:
# Now we want take the pairs with have lenght bigger than one to skip the pairs ['']
pairs = [pair for pair in pairs if len(pair) > 1]
print("There are {} pairs/conversations in the dataset".format(len(pairs)))
pairs = filterPairs(pairs)
print("After filtering, there are {} pairs/conversations".format(len(pairs)))

There are 221282 pairs/conversations in the dataset
After filtering, there are 64271 pairs/conversations


## Getting Rid of Rare Words

In [165]:
# Loop through each pair of and add the question and reply sentence to the vocabulary
for pair in pairs:
    voc.addSentence(pair[0])
    voc.addSentence(pair[1])
print("Counted words:", voc.num_words)

Counted words: 18008


In [166]:
for pair in pairs[:10]:
    print(pair)

['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', 'the real you .']


### Trim rare words
If the frequency is below than 3 the word is skipped from dictionary

In [167]:
MIN_COUNT = 3  #Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)  # calling the trim function previous builded in the beginning of the code
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sequence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:  #the word2index contains the filtered words
                keep_input = False
                break
        # Check input sequence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break
        
        # Only keep pairs that do not contain trimmed word(s) in their input or outpur sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)
    
    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs),
                                                               len(keep_pairs),
                                                               len(keep_pairs)/len(pairs)))
    return keep_pairs

In [168]:
# Trim voc and pairs
pairs = trimRareWords(voc,pairs,MIN_COUNT)

keep_words 7823 / 18005 = 0.4345
Trimmed from 64271 pairs to 53165, 0.8272 of total


# Prepare Data for Models

Although we have put a great deal of effort into preparing and massaging our data into a nice vocabulary object and list of sentence pairs, our models will ultimately **expect numerical torch tensors as inputs.** One way to prepare the processed data for the models can be found in the **[seq2seq translation tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).** In that tutorial, we use a batch size of 1, meaning that all we have to do is convert the words in our sentence pairs to their corresponding indexes from the vocabulary and feed this to the models.

However, if you’re interested in speeding up training and/or would like to leverage **GPU parallelization** capabilities, you will need **to train with mini-batches.**

Using mini-batches also means that **we must be mindful of the variation of sentence length in our batches.** To accomodate sentences of different sizes in the same batch, we will make our batched input tensor of shape (max_length, batch_size), where **sentences shorter than the max_length are zero padded after an EOS_token.**

If we simply convert our English sentences to tensors by converting words to their indexes(indexesFromSentence) and zero-pad, our tensor would have shape (batch_size, max_length) and indexing the first dimension would return a full sequence across all time-steps. However, we need to be able to index our batch along time, and across all sequences in the batch. Therefore, **we transpose our input batch shape to (max_length, batch_size), so that indexing across the first dimension returns a time step across all sentences in the batch.** We handle this transpose implicitly in the 
zeroPadding function.

**Figure Below**

Each word is represented by an index. In other words, each number represents a different word. 

In the matrix the number of columns is the _max_length_ and the number of rows is the _batch_size_. The maximum length of the sentence was configured by us as 10. So the maximum number of columns will be 10. 

We transpose the matrix because we are processing it in batchs. So, each row will be passed in each timestamp, or in each LSTM. We take one row at the time and feed the timestamp of the LSTM.

![](gpu_batch.png)

In [172]:
#get the index of each sentences
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]

In [170]:
pairs[1][0]

'you have my word . as a gentleman'

In [174]:
# test the function
indexesFromSentence(voc,pairs[1][0])

#Remember at the end we a index that represents the token word
# here the end punct(.) is identified as a index 2

[7, 8, 9, 10, 4, 11, 12, 13, 2]

In [176]:
#Define some samples for testing
inp = []
out = []
for pair in pairs[:10]:   # selecting the first 10 sentences
    inp.append(pair[0])
    out.append(pair[1])
    
print(inp)
print(len(inp))
indexes = [indexesFromSentence(voc,sentence) for sentence in inp]
indexes

['there .', 'you have my word . as a gentleman', 'hi .', 'have fun tonight ?', 'well no . . .', 'then that s all you had to say .', 'but', 'do you listen to this crap ?', 'what good stuff ?', 'wow']
10


[[3, 4, 2],
 [7, 8, 9, 10, 4, 11, 12, 13, 2],
 [16, 4, 2],
 [8, 31, 22, 6, 2],
 [33, 34, 4, 4, 4, 2],
 [35, 36, 37, 38, 7, 39, 40, 41, 4, 2],
 [42, 2],
 [47, 7, 48, 40, 45, 49, 6, 2],
 [50, 51, 52, 6, 2],
 [58, 2]]

## Understanding the Zip Function (zero padding)

We need make the number of columns consistent and order in descendent manner.

In [177]:
# Learning some extra functions that are helpfull in python

a = ['A','B','C']
b = [1, 2, 3]
list(zip(a,b))

[('A', 1), ('B', 2), ('C', 3)]

In [178]:
# If we have more elements in the "a" list than the "b" list the extra elements will be ingored. 
# But we have another function zip_longest that belongs to itertools method that allow us see these extra elements.
a = ['A','B','C','D','E']
b = [1, 2, 3]
list(itertools.zip_longest(a,b))

[('A', 1), ('B', 2), ('C', 3), ('D', None), ('E', None)]

In [186]:
# Now we will be use the zip_longest for prepare the matrix.
# See what happens when we zip the next list.
a = [[3, 4, 2],
 [7, 8, 9, 10, 4, 11, 12, 13, 2],
 [16, 4, 2],
 [8, 31, 22, 6, 2],
 [33, 34, 4, 4, 4, 2],
 [35, 36, 37, 38, 7, 39, 40, 41, 4, 2],
 [42, 2],
 [47, 7, 48, 40, 45, 49, 6, 2],
 [50, 51, 52, 6, 2],
 [58, 2]]

print(list(itertools.zip_longest(*a)))

# We see when we don't have elements to zip it it fills with None. 
# And wee see the fist element of this list is the first columun of the input list.
# And we see the most longest list is the first list that appears in the nested list. So, it order as we want.

print('\nfirst element of the list')
list(itertools.zip_longest(*a))[0]

[(3, 7, 16, 8, 33, 35, 42, 47, 50, 58), (4, 8, 4, 31, 34, 36, 2, 7, 51, 2), (2, 9, 2, 22, 4, 37, None, 48, 52, None), (None, 10, None, 6, 4, 38, None, 40, 6, None), (None, 4, None, 2, 4, 7, None, 45, 2, None), (None, 11, None, None, 2, 39, None, 49, None, None), (None, 12, None, None, None, 40, None, 6, None, None), (None, 13, None, None, None, 41, None, 2, None, None), (None, 2, None, None, None, 4, None, None, None, None), (None, None, None, None, None, 2, None, None, None, None)]

first element of the list


(3, 7, 16, 8, 33, 35, 42, 47, 50, 58)

In [191]:
# If we want fill the None value by zero it easy:
list(itertools.zip_longest(*a, fillvalue = 0))

[(3, 7, 16, 8, 33, 35, 42, 47, 50, 58),
 (4, 8, 4, 31, 34, 36, 2, 7, 51, 2),
 (2, 9, 2, 22, 4, 37, 0, 48, 52, 0),
 (0, 10, 0, 6, 4, 38, 0, 40, 6, 0),
 (0, 4, 0, 2, 4, 7, 0, 45, 2, 0),
 (0, 11, 0, 0, 2, 39, 0, 49, 0, 0),
 (0, 12, 0, 0, 0, 40, 0, 6, 0, 0),
 (0, 13, 0, 0, 0, 41, 0, 2, 0, 0),
 (0, 2, 0, 0, 0, 4, 0, 0, 0, 0),
 (0, 0, 0, 0, 0, 2, 0, 0, 0, 0)]

In [195]:
# Now we know how thwe zip function works and we are able to proceed:
def zeropadding(l,fillvalue = 0):
    return list(itertools.zip_longest(*l, fillvalue = fillvalue))

In [199]:
#test the function
test_result = zeropadding(indexes)
print('maximum length: ',len(test_result))
test_result

maximum length:  10


[(3, 7, 16, 8, 33, 35, 42, 47, 50, 58),
 (4, 8, 4, 31, 34, 36, 2, 7, 51, 2),
 (2, 9, 2, 22, 4, 37, 0, 48, 52, 0),
 (0, 10, 0, 6, 4, 38, 0, 40, 6, 0),
 (0, 4, 0, 2, 4, 7, 0, 45, 2, 0),
 (0, 11, 0, 0, 2, 39, 0, 49, 0, 0),
 (0, 12, 0, 0, 0, 40, 0, 6, 0, 0),
 (0, 13, 0, 0, 0, 41, 0, 2, 0, 0),
 (0, 2, 0, 0, 0, 4, 0, 0, 0, 0),
 (0, 0, 0, 0, 0, 2, 0, 0, 0, 0)]

In [192]:
leng = [len(ind) for ind in indexes]
max(leng)  # only to cehck if the maxium lengh is 10

10

In [193]:
leng

[3, 9, 3, 5, 6, 10, 2, 8, 5, 2]

In [203]:
# Converts our index tensor in a binary tensor (0's  and 1's)
def binaryMatrix(l, value=0):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

In [205]:
binary_result = binaryMatrix(test_result)
binary_result

[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 0, 1, 1, 0],
 [0, 1, 0, 1, 1, 1, 0, 1, 1, 0],
 [0, 1, 0, 1, 1, 1, 0, 1, 1, 0],
 [0, 1, 0, 0, 1, 1, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]

In [222]:
#returns padded input sequence tensor and as well as a tensor of lenghts for each of the sequences in the batch
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc,sentence) for sentence in l]
    lenghts = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeropadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lenghts

In [223]:
# Returns padded target sequence tensor, padding mask, and target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc,sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeropadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.ByteTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

In [228]:
# Returns all items for a given batch pairs
def batch2TrainData(voc, pair_batch):
    #Sort the questions in descending length
    # We take the quention [0] split it by " " and return the length of the sentence
    # It will sort by the key which is the lenght of the sentence
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse = True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lenghts = inputVar(input_batch, voc)
    # assert len(inp) == lenghts[0]
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lenghts, output, mask, max_target_len

In [210]:
#Understanding the lambda function
def add(x):
    return x + 1
print(add(2))

add_l = lambda x: x + 1 #after semi colunms (:) we have the return of the lambda function
print(add_l(2))

3
3


In [240]:
# Test the Function
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lenghts, target_variable, mask, max_target_len = batches

print("input_variable:")
print(input_variable)
print("lenghts: ", lenghts)
print("target_variable: ")
print(target_variable)
print("mask:")
print(mask)
print("max_target_len:", max_target_len)

input_variable:
tensor([[ 190,   25,   95,   75, 3149],
        [  51,  279,    7,   50,    2],
        [  98,    7,   89,    6,    0],
        [  12,   25, 1095,    2,    0],
        [ 180, 2154,   76,    0,    0],
        [2730,  169,   66,    0,    0],
        [4780,  174,    2,    0,    0],
        [  23,    4,    0,    0,    0],
        [   6,    2,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
lenghts:  tensor([10,  9,  7,  4,  2])
target_variable: 
tensor([[  79,   34, 1095,   75,  598],
        [   4,   67,   50,    5, 3149],
        [   2,   25,    6,    7,   66],
        [   0,  260,    2,   14,    2],
        [   0,    8,    0,  144,    0],
        [   0,    7,    0,    4,    0],
        [   0,    7,    0,    2,    0],
        [   0,   24,    0,    0,    0],
        [   0,    2,    0,    0,    0]])
mask:
tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [0, 1, 1, 1, 1],
        [0, 1, 0, 1, 0],
        [0, 1, 0, 1, 0],
        [

# Defining Models

## seq2seq Model

The brains of our chatbot is a sequence-to-sequence (seq2seq) model. The goal of a seq2seq model is to take a variable-length sequence as an input, and return a variable-length sequence as an output using a fixed-sized model.

Sutskever et al. discovered that by **using two separate recurrent neural nets together**, we can accomplish this task. **One RNN acts as an encoder**, which encodes a variable length input sequence to a fixed-length context vector. In theory, this context vector (the final hidden layer of the RNN) will contain semantic information about the query sentence that is input to the bot. **The second RNN is a decoder**, which takes an input word and the context vector, and returns a guess for the next word in the sequence and a hidden state to use in the next iteration.

![](enc_dec.png)

The setps to be taken are:

        1. Convert word indexes to embeddings.
        2. Pack padded batch of sequences for RNN module.
        3. Forward pass through GRU.
        4. Unpack padding.
        5. Sum bidirectional GRU outputs.
        6. Return output and final hidden state.

We will use a bidirectional variant of the GRU, meaning that there are essentially two independent RNNs: one that is fed the input sequence in normal sequential order, and one that is fed the input sequence in reverse order. The outputs of each network are summed at each time step. Using a bidirectional GRU will give us the advantage of encoding both past and future context.

## Encoder

In [241]:
class EnconderRNN(nn.Module):
    def __init_(self, hidden_size, embedding, n_layers=1, dropout=0):
        """
        hidden_size : it is the size of the hidden layer, or the nuber of the neurons that we have in the hidden layer
        embedding : responsable to convert the index to a dense vector value
        """
        super(EnconderRNN, self).__init__() # Explains the super function: https://realpython.com/python-super/
        self.n_layer = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding
        # initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        # because our input size is a word embedding with number of features == hidden_size
        # documentation of GRU: https://pytorch.org/docs/stable/nn.html?highlight=gru#torch.nn.GRU
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout),bidirectional=True)
        
    def forward(self, input_seq, input_lenghts, hidden=None): # forward propagation
        # input_seq: batch of input sentences; shape=(max_length, bathc_size)
        # input_lenghts: list of sentence lengths corresponding to each sentene in the batch
        # hidden state of shape: (n_layers x num_directions, batch_size, hidden_size )
        # num_directions = 2 because we are in the bidirectional
        # Convert word indezes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = torch.nn.utils.rnn.pack_padded_sequence(embedded, input_lenghts)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:,:, self.hidden_size:]
        # Return output nd final hidden state
        return outputs, hidden
        # outputs: the output features h_t from last layer of the GRU, for each timestep (sum of bidirectional outputs)
        # outputs shape=(max_lenght, batch-size,hidden_size)
        # hidden: hidden state for the last timestep, of shape=(n_layers x num_directions, batch_size, hidden_size)
        

In [243]:
a = torch.randn(5,4,3) # 5 channels, 4 rows, 3 columns
a                      # we gave 4 rows and 3 coluns in every channel 

tensor([[[-0.0361,  1.2720, -0.4121],
         [ 1.3197, -0.3087, -0.5791],
         [-0.8038,  0.6398, -0.3175],
         [ 2.0555, -0.4535,  0.0696]],

        [[ 0.7546,  1.0242, -1.1892],
         [ 0.3190, -1.2766,  0.6723],
         [ 0.0298, -2.4920,  0.2441],
         [ 0.3087,  0.8232,  0.3213]],

        [[ 0.8855, -0.5930,  0.3127],
         [-0.6480, -0.5067,  0.1206],
         [-0.9421, -1.2834, -0.5306],
         [-0.1097,  2.2399, -0.2256]],

        [[ 0.1231, -0.0672,  1.6914],
         [-1.0091, -0.0132,  1.9978],
         [-0.4985, -1.7161,  1.6728],
         [ 0.5662, -1.4001,  0.7810]],

        [[-0.4589,  1.5913,  0.5146],
         [ 0.8215, -0.5146,  0.7680],
         [ 0.0756, -0.4400, -0.5437],
         [-0.4954,  0.5183, -0.3796]]])

In [244]:
a[:,:,:2] # we want to see all channels, all rows, and columns from 0 until 2 exclusive

tensor([[[-0.0361,  1.2720],
         [ 1.3197, -0.3087],
         [-0.8038,  0.6398],
         [ 2.0555, -0.4535]],

        [[ 0.7546,  1.0242],
         [ 0.3190, -1.2766],
         [ 0.0298, -2.4920],
         [ 0.3087,  0.8232]],

        [[ 0.8855, -0.5930],
         [-0.6480, -0.5067],
         [-0.9421, -1.2834],
         [-0.1097,  2.2399]],

        [[ 0.1231, -0.0672],
         [-1.0091, -0.0132],
         [-0.4985, -1.7161],
         [ 0.5662, -1.4001]],

        [[-0.4589,  1.5913],
         [ 0.8215, -0.5146],
         [ 0.0756, -0.4400],
         [-0.4954,  0.5183]]])

### Understanding pack-padded_sequence
![](pad_padded_sequence.png)


## Decoder
The decoder RNN generates the response sentence in a token-by-token fashion. It uses the encoder’s context vectors, and internal hidden states to generate the next word in the sequence. We will use an "attention mechanism” that allows the decoder to pay attention to certain parts of the input sequence, rather than using the entire fixed context at every step. At a high level, attention is calculated using the decoder’s current hidden state and the encoder’s outputs. We will also use "Global attention", where we consider al of encoder's hidden states, as oppossed to "Local attention", which only considers the enconder's hidden state from the current time step, as well as calculate attention weights using the hidden staate of decoder fro the currtent time step only. The output of this attention module is a softmax normalized weights tensor of shape (batch_size, 1, max_length).

In Global Attention, there are various methods to calculate the attention energies between the enconder output and decoder output which are called "score functions": _ht_ is the current target decoder state and _hs_bar_ is the all enconder states.

![](attention_opperation.png)


## Attention Mechanism Diagram

![](attention_mech2.png)

In [275]:
#Luong attention layer
class Attn(torch.nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init.__()
        self.method = method
        self.hidden_size = hidden_size
        
    def dot_score(self, hidden, encoder_output):
        # Element-Wise Multiply the current target decoder state with the encoder output and sum them
        return torch.sum(hidden * encoder_output, dim=2)
    
    def forward(self, hidden, encoder_outputs):
        """
        It is the forward propagation.
        It is how the Attention is calculated
        """
        
        # hidden of shape: (1, batch_size, hidden_size) -> decoder output for Attention Mechanism
        # 1 because we are calculating one GRU per time step
        # encoder_outputs of shape: (max_length, batch_size, hidden_size)
        # (1, batch_size, hidden_size) * (max_length, batch_size, hidden_size) = (max_length, batch_size, hidden_size)
        
        # Calculate tje attention weights(energies)
        attn_energies = self.dot_score(hidden, encoder_outputs)      # (max_length, batch_size)
        # Transpose max_length and batch-size dimensions
        attn_energies = attn_energies.t()                            # (batch_size, max_length)
        # return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)          # (batch_size, 1, max_length)  
    #unsqueeze(1) add one dimension in the tensor (batch_size, "1", max_length)

In [270]:
# Only to explain the sum(dim=2) to understand the softmax(dim=1) is calculated for every row. 
#lenght size: 5 (channels)  ;    batch size : 3 (rows)     ; hidden size: 7 (columns)
import torch
a = torch.randn(5,3,7)
a 

tensor([[[ 0.0657, -0.4565,  0.7132, -3.2464,  0.4982,  0.6139,  1.4410],
         [ 0.2646, -0.3682, -0.1334, -0.2555,  1.4373, -0.8832, -0.6729],
         [-0.4064, -0.6610, -0.6169, -1.1091,  0.4788, -0.3371, -1.5309]],

        [[ 0.9737,  0.9701,  0.9699, -0.7398, -0.0400, -0.6627, -0.1215],
         [-0.4349,  0.5204, -0.2218, -0.3981,  0.3841,  0.2765, -0.2465],
         [-0.1210, -1.1098,  0.5458, -0.1704,  0.6452,  1.1778, -0.8744]],

        [[-1.1025,  0.6106, -0.1763,  2.0263,  1.2617, -0.8445, -0.2611],
         [-0.3142,  0.4670,  0.0525, -0.4820, -0.9280, -0.8001, -0.4884],
         [-0.6144, -0.8240, -1.0479,  0.9472,  1.6395, -0.7119, -0.0059]],

        [[-1.8609, -0.2115,  1.3001, -1.2104, -2.1163, -0.1173, -0.2474],
         [ 1.2892,  0.2785, -0.1880,  0.1181, -0.1948, -1.5507,  0.4915],
         [ 0.5297, -0.1946,  0.6487,  0.8425, -0.4199,  1.5457,  1.7494]],

        [[ 1.0840, -1.8077, -0.4910,  0.0595, -1.0137,  1.0619,  0.4976],
         [ 0.4730, -1.2591,  0

In [274]:
print(torch.sum(a,dim=2)) # sim across the columns

# first row
print(sum([ 0.0657, -0.4565,  0.7132, -3.2464,  0.4982,  0.6139,  1.4410]))

tensor([[-0.3709, -0.6113, -4.1825],
        [ 1.3498, -0.1202,  0.0931],
        [ 1.5141, -2.4933, -0.6174],
        [-4.4637,  0.2438,  4.7016],
        [-0.6094,  0.2185, -0.4035]])
-0.37089999999999956


In [276]:
# demonstrate the softmax
import torch.nn.functional as F
a = torch.rand(5,7)
a

tensor([[0.1661, 0.1634, 0.5879, 0.3865, 0.6011, 0.7049, 0.4339],
        [0.4073, 0.6695, 0.6044, 0.2131, 0.1616, 0.5841, 0.0606],
        [0.7633, 0.0741, 0.8596, 0.1466, 0.6943, 0.8592, 0.0101],
        [0.8591, 0.9827, 0.2263, 0.6077, 0.3489, 0.3362, 0.2988],
        [0.5157, 0.5268, 0.9734, 0.8952, 0.5876, 0.0105, 0.0759]])

In [280]:
 b = F.softmax(a,dim=1)

In [281]:
b[0].sum()

tensor(1.)

In [297]:
for i in range(0,5):
    print(b[i].sum())

tensor(1.)
tensor(1.)
tensor(1.0000)
tensor(1.0000)
tensor(1.)


In [317]:
new_list = list(range(0,5))
list(map(lambda x: print(b[x].sum()),new_list))

tensor(1.)
tensor(1.)
tensor(1.0000)
tensor(1.0000)
tensor(1.)


[None, None, None, None, None]

## Designing the Decoder

Now we are use our Attention to implement our Decoder

For the decoder, we will manually feed our batch one time step at a time. This means that our embedded word tensor and GRU output will both have shape (1, batch_size, hidden_size). The steps are:

       1. Get embedding of current input word.
       2. Forward through unidirectional GRU.
       3. Calculate attention weights from the current GRU output from (2).
       4. Multiply attention weights to encoder outputs to get new “weighted sum” context vector.
       5. Concatenate weighted context vector and GRU output using Luong eq. 5.
       6. Predict next word using Luong eq. 6 (without softmax).
       7. Return output and final hidden state.



- output of shape (seq_len,batch_num, num_directions*hidden_size)  -> go to pytorch documentation to see this information
-  seq_len = 1 becasuse we are calculating one GRU per time step
- num_direction = 2 because we are using bidirectional
- h_n is the hidden of your last GRU output. We are not working with a sequency of a GRU. We are working with one GRU at the time. The hidden state return by the GRU 14 *1 

In [324]:
class LuongAttndecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
        
        """
        Embeddings: The inputs of shape (1, batch_size) is only the indexes of the words 
        we need to transform them to an embedding dense vector. This represents the feactures of the words.
        We need to use it every time when we have words.
        """
        
        # Defining Layers: dropout reduces the overfitting
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers=1, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        
        self.attn = Attn(attn_model, hidden_size) #initialize the attention model inside of the class
    
    def forward(self, input_step, last_hidden, enconder_outputs):
        # input_step: one time step (one word) of input sequence batch; shpae=(1, batch_size)
        # last_hidden: final hidden state of encoder GRU; shape=(n_layers x num_directions, batch_size, hidden_size)
        # encoder_outputs: enconder model's output; shape=(seq_len, batch, num_directions x hidden_size)
        # Note: we run this one step (batch of words) at a time
        
        # Get embedding of current input word
        embedded = self.embedding(input_step)              # represents a word as features
        embedded = self.embedding_dropout(embedded)
        #Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded,last_hidden) # input step and the last hidden it is the encoder output
        # rnn_output of shape (seq_len, batch, num_directions * hidden_size)   
        # hidden of shape (num_layers * num_directions, batch, hidden_size)
        
        # calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output,encoder_ouputs)
        # Multiply attention weights to enconder outputs to get new "weighted sum" context vector
        # (batch_size, 1 , max_lenght) bmm with (batch_size, max_length, hidden) = (batch_size,1, hidden)
        # context vector: shape(batch_size,1, hidden)
        """ bmm -> batch multiply function to multiply 3D tensor
        We see in the diagram
            - encoder output: shape(max_length, batch_size, hidden_size)
            - attention output: shape(batch_size,1,max_length)
        So we are multiply (attn_weights with enconder_outputs) using bmm
        """ 
        context = attn_weights.bmm(enconder_outputs.transpose(0,1))
        
        # Concatenate weighted context vector and GRU output
        rnn_output = rnn.output.squeeze(0) # we skip the first dimension of GRU output
                                           # from (1, batch_size, hidden_size) to (batch_size, hidden_size)
            
        context = context.squeeze(1)       # we skip the first dimension of context output
                                           # from (batch_size, 1, hidden_size) to (batch_size, hidden)
            
        concat_input = torch.cat((rnn_output,context),1)           # concatenate both two along the columns
        concat_output = torch.tanh(self.concat(concat_input))
        # predict next word usding Luong eq. 6
        output = self.out(concat_output)  # the columns represents the distributions of the vocabulary words
        output = F.softmax(output,dim=1)
        # Return output and final hidden state
        return output, hidden
        # output: softmax normalized tensor giving proabilities of each word being the correct next word in 
        #         the decoded sequence
        # shape = (batch_size, voc.num_words)
        # hidden: final hidden state of GRU; shpae=(n_layers x num_directions, batch_size, hidden_size)
        

### We're done with building the Architecture. Now Let's move on to the Training code

## Creating the Loss Function

Since we are dealing with batches of padded sequences, we cannot simply consider all elements of the tensor when calculating loss. We define maskNLLLoss to calculate our loss based on our decoder's output tensor, the target tensor, and a binary mask tensor describing the padding of the targer tensor. This loss function calculates the average negative log likelihood of the lements that correspond to a 1 in the mask tensor.


In [325]:
def maskNLLLoss(decoder_out, target, mask):  
    
    """
    NLLLoss : Negative LikeLihood Loss
    The mask is returned by 'batch2TrainData' function
    """
    
    nTotal = mask.num()        # How many elements should we consider
    target = target.view(-1,1) # because we use torch.gather.
                               # -1 python decides the dimension-> in this case is batch_size, 1 dimension=1
        
    # decoder_out shape: (batch_size, vocab_size), target_size = (batch_size, 1)
    gathered_tensor = torch.gather(decoder_out, 1, target)
    # calculate the Negative Log Likelihood Loss
    crossEntropy = -torch.log(gathered_tensor)
    # Select the non-zero elements
    loss = crossEntropy.masked_select(mask) # according with this mask it tasks 
                                            # the correspond element of thhe crossEntropy tensor
    #calculate the mean of the Loss
    loss = loss.mean()
    loss = loss.to(device)                  # transform to the CUDA otherwise transform to the GPU
    return loss, nTotal.item()

In [328]:
# Visualize what's hapening in the Loss Function
# decoder_out shape: (batch_size, vocab_size), target_size = (batch_size,1)
dec_o = torch.rand(5,7)
dec_o = F.softmax(dec_o, dim=1)
tar = torch.tensor([2, 1, 5, 4, 0], dtype = torch.long)
tar = tar.view(-1,1)
mask = torch.tensor([1, 0, 1, 1, 0], dtype = torch.uint8)
print(dec_o)
print(tar)
gath_ten = torch.gather(dec_o, 1 ,tar) # Get the softmax scores for the expected correct predictions
print(gath_ten)
print(gath_ten.shape)
crossEntropy= - torch.log(gath_ten)
print("Cross Entropy:")
print(crossEntropy)
mask = mask.unsqueeze(1)
loss = crossEntropy.masked_select(mask)
print("Loss:")
print(loss)
print(loss.shape)
print("Sum of mask elements (How many elements we are considering): ", mask.sum())
print("Mean of the Loss: ", loss.mean())
print("Mean of the cross-entropy loss (without masking):", crossEntropy.mean())

tensor([[0.2025, 0.0812, 0.0970, 0.1974, 0.1881, 0.1015, 0.1323],
        [0.1270, 0.1406, 0.1583, 0.1530, 0.1766, 0.1612, 0.0832],
        [0.1120, 0.1860, 0.1218, 0.1724, 0.1263, 0.1447, 0.1367],
        [0.1261, 0.1621, 0.1073, 0.1976, 0.1679, 0.1646, 0.0743],
        [0.1804, 0.1125, 0.1055, 0.1987, 0.1898, 0.1127, 0.1004]])
tensor([[2],
        [1],
        [5],
        [4],
        [0]])
tensor([[0.0970],
        [0.1406],
        [0.1447],
        [0.1679],
        [0.1804]])
torch.Size([5, 1])
Cross Entropy:
tensor([[2.3332],
        [1.9622],
        [1.9332],
        [1.7845],
        [1.7124]])
Loss:
tensor([2.3332, 1.9332, 1.7845])
torch.Size([3])
Sum of mask elements (How many elements we are considering):  tensor(3)
Mean of the Loss:  tensor(2.0170)
Mean of the cross-entropy loss (without masking): tensor(1.9451)


## Teaching Forcing 

![](teach_force.png)


We will use **Teaching Forcing** in Training. This means that at some probability, set by teacher_forcing-ratio, we use the current target word  as the decoder’s next input rather than using the decoder’s current guess. This technique acts as training wheels for the decoder, aiding in more efficient training. However, teacher forcing can lead to model instability during inference, as the decoder may not have a sufficient chance to truly craft its own output sequences during training. Thus, we must be mindful of how we are setting the teacher_forcing_ratio, and not be fooled by fast convergence. The second trick that we implement is **Gradient Clipping**. This is a commonly used technique for countering the “exploding gradient” problem. In essence, by clipping or thresholding gradients to a maximum value, we prevent the gradients from growing exponentially and either overflow (NaN), or overshoot steep cliffs in the cost function.

**Sequence of Operations in Training:**

        1. Forward pass entire input batch through encoder.
        2. Initialize decoder inputs as SOS_token, and hidden state as the encoder’s final hidden state.
        3. Forward input batch sequence through decoder one time step at a time.
        4. If teacher forcing: set next decoder input as the current target; 
             else: set next decoder input as current decoder output.
        5. Calculate and accumulate loss.
        6. Perform backpropagation.
        7. Clip gradients.
        8. Update encoder and decoder model parameters.
        
Before move on to Training, let's see a live training and waht's happening with the data: 

In [None]:
# Visualizing what's happening in one iteration. Only rus this for visualization.
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lengts, target_variable, mask, max_target_len = batches

print("input_variable shape:", input_variable.shape)
print("lengts shape:", lengts.shape)
print("target_variable shape:", target_variable.shape)
print("mask shape:", mask.shape)
print("max_target_len  shape:", max_target_len.shape)

# define the parameters
hidden_size= 500
enconder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
attn_model = 'dot'
embedding = nn.Embedding(voc.num_words, hidden_size)

#Define the enconder and Decoder
enconder = EnconderRNN(hidden_size, embedding, enconder_n_layers, dropout)
decoder = LuongAttndecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
enconder = enconder.to(device)

    
