## Building a Chatbot: PyTorch

Note: This code has been adopted from https://pytorch.org/tutorials/beginner/chatbot_tutorial.html. We will demonstrate more details and observe the output setp-by-step to have a deeper understanding.

We can follow the explaination in the video: https://www.youtube.com/watch?v=CNuI8OWsppg

In [1]:
import torch
import torch.nn as nn                   # neureal-network package
from torch import optim                 # optimazers
import torch.nn.functional as F         # functions like ReLu and another functionalities
import csv
import random
import re                               # regular expression library
import os
import unicodedata
import codecs
import itertools

In [2]:
CUDA = torch.cuda.is_available()   # True or False
device = torch.device("cuda" if CUDA else "cpu")

## Part 1: DataPreprocessing

Download data from: https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

In [76]:
lines_filepath = os.path.join("cornell movie-dialogs corpus","movie_lines.txt")   #("folder_name", "filename.txt")
conv_filepath = os.path.join("cornell movie-dialogs corpus","movie_conversations.txt")
save_dir = os.path.join("cornell movie-dialogs corpus", "save")

In [4]:
#Visualize some lines
with open(lines_filepath, 'r') as file:
    lines = file.readlines()
for line in lines[:8]:    # loop for first 8 lines
    print(line.strip())   # print the frist 8 lines

# LineID - CharacterID - MovieID - CharacterName - Speach of Character

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow
L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.
L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No


## Processing DataSet - Part 1

In [5]:
# Splits each line of the file into a dictionary of fields (lineID, characterID, movieID, character, text)
line_fields = ["lineID", "characterID", "movieID", "character", "text"]
lines = {}  # empty dictionary
with open(lines_filepath, 'r', encoding = 'iso-8859-1') as f:
    for line in f:
        values = line.split(" +++$+++ ")
        # Extract fields
        lineObj = {}   #temporary dictionary that resets at each loop (each line of the text)
        for i, field in enumerate(line_fields):
            lineObj[field] = values[i]
        lines[lineObj['lineID']] = lineObj
        
# Each line of lines we have a dictionary with the data for each LineID
lines

{'L1045': {'lineID': 'L1045',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'They do not!\n'},
 'L1044': {'lineID': 'L1044',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'They do to!\n'},
 'L985': {'lineID': 'L985',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'I hope so.\n'},
 'L984': {'lineID': 'L984',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'She okay?\n'},
 'L925': {'lineID': 'L925',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': "Let's go.\n"},
 'L924': {'lineID': 'L924',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'Wow\n'},
 'L872': {'lineID': 'L872',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': "Okay -- you're gonna need to learn how to lie.\n"},
 'L871': {'lineID': 'L871',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': 'No

In [6]:
# Illustrate the enumerate
line_fields = ["lineID", "characterID", "movieID", "character", "text"]
for i, field in enumerate(line_fields):
    print(i,field)


0 lineID
1 characterID
2 movieID
3 character
4 text


## Processing DataSet - Part 2
### movie_conversations.txt


In [7]:
# groups fields of lines from 'LoadLines' into conversations based on "movie-conversations.txt"


# characterID1 - actor 1
# characterID2 - actor 2
# movieID - movie ID
# utteranceIDs - lines of dialogue between two actors

conv_fields = ["characterID1", "characterID2", " movieID", "utteranceIDs"]
conversations = []
with open(conv_filepath, 'r', encoding = "iso-8859-1") as f:
    for line in f:
        values = line.split(" +++$+++ ")
        # Extract fields
        convObj = {}   # empty dictionary
        for i, field in enumerate(conv_fields):
            convObj[field] = values[i]
        # Convert string result from split to list, since convObj["utteranceIDs"] == ['L598485', 'L598486', ...]
        lineIDs = eval(convObj["utteranceIDs"])
        
        #Reassemble lines
        convObj["lines"] = []
        for lineID in lineIDs:
            convObj["lines"].append(lines[lineID])
        conversations.append(convObj)
    

In [8]:
conversations[0]

{'characterID1': 'u0',
 'characterID2': 'u2',
 ' movieID': 'm0',
 'utteranceIDs': "['L194', 'L195', 'L196', 'L197']\n",
 'lines': [{'lineID': 'L194',
   'characterID': 'u0',
   'movieID': 'm0',
   'character': 'BIANCA',
   'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
  {'lineID': 'L195',
   'characterID': 'u2',
   'movieID': 'm0',
   'character': 'CAMERON',
   'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
  {'lineID': 'L196',
   'characterID': 'u0',
   'movieID': 'm0',
   'character': 'BIANCA',
   'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
  {'lineID': 'L197',
   'characterID': 'u2',
   'movieID': 'm0',
   'character': 'CAMERON',
   'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}]}

## Processing DataSet - Part 3

Put together the questions and answers, combine them, to process de enconder and decoder

We transform the dataset in form to questions and answers.

We take the question and the respective reply and store them in the list.

In [9]:
# Extract the data of the dictionary lines of the first element 
conversations[0]["lines"]

[{'lineID': 'L194',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'},
 {'lineID': 'L195',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': "Well, I thought we'd start with pronunciation, if that's okay with you.\n"},
 {'lineID': 'L196',
  'characterID': 'u0',
  'movieID': 'm0',
  'character': 'BIANCA',
  'text': 'Not the hacking and gagging and spitting part.  Please.\n'},
 {'lineID': 'L197',
  'characterID': 'u2',
  'movieID': 'm0',
  'character': 'CAMERON',
  'text': "Okay... then how 'bout we try out some French cuisine.  Saturday?  Night?\n"}]

In [10]:
# Check how many dialogues we have in the first element
len(conversations[0]["lines"])

4

In [11]:
# Extract the data of the first conversation in the first element
conversations[0]["lines"][0]

{'lineID': 'L194',
 'characterID': 'u0',
 'movieID': 'm0',
 'character': 'BIANCA',
 'text': 'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'}

In [12]:
# Extract the text of the first conversation in the first element
conversations[0]["lines"][0]["text"]

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\n'

In [13]:
conversations[0]["lines"][0]["text"].strip() # this eliminate the \n and we got the pure text

'Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.'

In [14]:
# Extract pairs of sentences from conversation
qa_pairs = []

for conversation in conversations:
    # Iterate over all the lines of conversation
    for i in range(len(conversation["lines"])-1):
        inputLine = conversation["lines"][i]["text"].strip()
        targetLine = conversation["lines"][i+1]["text"].strip()
        #Filter wrong samples (if one of the lists is empty)
        if inputLine and targetLine:
            qa_pairs.append([inputLine, targetLine])

In [15]:
qa_pairs[1]

["Well, I thought we'd start with pronunciation, if that's okay with you.",
 'Not the hacking and gagging and spitting part.  Please.']

## Processing DataSet - Part 4

Save the dialogue in a text file to not be processed many times

In [16]:
# Define path to new file
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
delimiter =  "\t" # the pair question-answer will be delimited by TAB
#Unescape the delimiter
delimiter = str(codecs.decode(delimiter, "unicode-escape"))

# Write new csv file
print("\nWriting newly formatted file...")
with open(datafile, 'w', encoding = "utf-8") as outputfile:
    writer = csv.writer(outputfile, delimiter = delimiter)
    for pair in qa_pairs:
        writer.writerow(pair)
print("Done writing to file")


Writing newly formatted file...
Done writing to file


In [17]:
# Visualize some lines
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
with open(datafile, 'rb') as file:
    lines = file.readlines()

for line in lines[:8]:
    print(line)

b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\r\r\n"
b'Why?\tU

## Processing the Words

In [18]:
PAD_token = 0 # Used for padding short sentences
SOS_token = 1 # Start-of-sentence token <START>
EOS_token = 2 # End-os-sentence token <END>

class Vocabulary:
    def __init__(self, name):
        self.name = name
        self.word2index = {}   # key-value pair (for word: car - 10, road - 34)
        self.word2count = {}   # counts the frequency of the words
        self.index2word = {PAD_token: "PAD", SOS_token:"SOS", EOS_token:"EOS"}
        self.num_words = 3 #Count PAD, SOS, EOS  #initializatoin f the numbre of the words 
    
    # Next we define methods
    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)

    def addWord(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.num_words
            self.word2count[word] = 1
            self.index2word[self.num_words] = word
            self.num_words += 1
        else:
            self.word2count[word] += 1
            
    # Remove words below a certain cound threshold
    def trim(self, min_count):
        keep_words = []
        for k, v in self.word2count.items():   # loop for each key and values
            if v >= min_count:
                keep_words.append(k)
        
        print("keep_words {} / {} = {:.4f}".format(len(keep_words), 
                                                   len(self.word2index), 
                                                   len(keep_words) / len(self.word2index)))
        # reinitialize dictionaries
        self.word2index = {}   # key-value pair (for word: car - 10, road - 34)
        self.word2count = {}   # counts the frequency of the words
        self.index2word = {PAD_token: "PAD", SOS_token:"SOS", EOS_token:"EOS"}
        self.num_words = 3 #Count PAD, SOS, EOS  #initializatoin f the numbre of the words 
        
        # here we only keep the words that have value >= min_count
        for word in keep_words:
            self.addWord(word)

## Preprocessing: Remove punctuations and signs

In [19]:
# Turn a Unicode string to plain ASCII
def unicodeToAscii(s):
    return ''.join(c for c in unicodedata.normalize('NFD',s) if unicodedata.category(c) != 'Mn')

# Mn - non mark space

In [20]:
# testing the function unicodeAscii
unicodeToAscii("Montréal, Françoise....")

'Montreal, Francoise....'

In [21]:
# Lower case, trim white spaces, lines, ...etc, and remove non-letter characters.
def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    # replace any .!? by a whitespace + character --> '!' = ' ! ' \1 menas the first bracket group --> [.!?].
    # r is to not consider \1 as a character (r to escape a backslash). + means one or more
    # re = regular expression
    s = re.sub(r"([.!?])", r" \1", s) #substitute this characters to the character and space
    # remove any character that is not a sequence of lower or upper case letters
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    # remove a sequence if whitespace character
    s = re.sub(r"\s+", r" ", s).strip() #remove lenf and right spaces
    return s 

In [22]:
# testing the function
normalizeString("aa123aa!s's   dd?")

'aa aa !s s dd ?'

## Processing text

In [23]:
datafile = os.path.join("cornell movie-dialogs corpus", "formatted_movie_lines.txt")
# read the file and split into lines
print("Reading and processing file....Please Wait")
lines = open(datafile, 'r', encoding = "utf-8").read().strip().split('\n')
# Split every line into pais and normalize
pairs = [[normalizeString(s) for s in pair.split('\t')] for pair in lines]
print("Done reading !")
voc = Vocabulary("cornell movie-dialogs corpus")

Reading and processing file....Please Wait
Done reading !


Explanation of the 

    pairs = [[normalizeString(s) for s in pair.split('\t')] for pair in lines]

In [24]:
lines[0].split('\t')

['Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.',
 "Well, I thought we'd start with pronunciation, if that's okay with you."]

In [25]:
normalizeString(lines[0].split('\t')[0])

'can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .'

In [26]:
normalizeString(lines[0].split('\t')[1])

'well i thought we d start with pronunciation if that s okay with you .'

In [27]:
pairs[0]

['can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .',
 'well i thought we d start with pronunciation if that s okay with you .']

### Limit the length of sentences 

In [28]:
# Returns True if both sentences in a pair 'p' are under the MAX_LENGTH threshold
MAX_LENGTH = 10 # maximum sentence length to consider (max_words)
def filterPair(p):  #each pair is a list of two elements
    # Input sequences need to perserve the last word for EOS token
    return len(p[0].split()) < MAX_LENGTH and len(p[1].split()) < MAX_LENGTH

# len(pairs[0][0].split())
# len(pairs[0][1].split())

# Filter pairs using filterpairs condition
def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]

In [29]:
pairs

[['can we make this quick ? roxanne korrine and andrew barrett are having an incredibly horrendous public break up on the quad . again .',
  'well i thought we d start with pronunciation if that s okay with you .'],
 [''],
 ['well i thought we d start with pronunciation if that s okay with you .',
  'not the hacking and gagging and spitting part . please .'],
 [''],
 ['not the hacking and gagging and spitting part . please .',
  'okay . . . then how bout we try out some french cuisine . saturday ? night ?'],
 [''],
 ['you re asking me out . that s so cute . what s your name again ?',
  'forget it .'],
 [''],
 ['no no it s my fault we didn t have a proper introduction', 'cameron .'],
 [''],
 ['cameron .',
  'the thing is cameron i m at the mercy of a particularly hideous breed of loser . my sister . i can t date until she does .'],
 [''],
 ['the thing is cameron i m at the mercy of a particularly hideous breed of loser . my sister . i can t date until she does .',
  'seems like she coul

In [30]:
# Now we want take the pairs with have lenght bigger than one to skip the pairs ['']
pairs = [pair for pair in pairs if len(pair) > 1]
print("There are {} pairs/conversations in the dataset".format(len(pairs)))
pairs = filterPairs(pairs)
print("After filtering, there are {} pairs/conversations".format(len(pairs)))

There are 221282 pairs/conversations in the dataset
After filtering, there are 64271 pairs/conversations


## Getting Rid of Rare Words

In [31]:
# Loop through each pair of and add the question and reply sentence to the vocabulary
for pair in pairs:
    voc.addSentence(pair[0])
    voc.addSentence(pair[1])
print("Counted words:", voc.num_words)

Counted words: 18008


In [32]:
for pair in pairs[:10]:
    print(pair)

['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', 'the real you .']


### Trim rare words
If the frequency is below than 3 the word is skipped from dictionary

In [33]:
MIN_COUNT = 3  #Minimum word count threshold for trimming

def trimRareWords(voc, pairs, MIN_COUNT):
    # Trim words used under the MIN_COUNT from the voc
    voc.trim(MIN_COUNT)  # calling the trim function previous builded in the beginning of the code
    # Filter out pairs with trimmed words
    keep_pairs = []
    for pair in pairs:
        input_sentence = pair[0]
        output_sentence = pair[1]
        keep_input = True
        keep_output = True
        # Check input sequence
        for word in input_sentence.split(' '):
            if word not in voc.word2index:  #the word2index contains the filtered words
                keep_input = False
                break
        # Check input sequence
        for word in output_sentence.split(' '):
            if word not in voc.word2index:
                keep_output = False
                break
        
        # Only keep pairs that do not contain trimmed word(s) in their input or outpur sentence
        if keep_input and keep_output:
            keep_pairs.append(pair)
    
    print("Trimmed from {} pairs to {}, {:.4f} of total".format(len(pairs),
                                                               len(keep_pairs),
                                                               len(keep_pairs)/len(pairs)))
    return keep_pairs

In [34]:
# Trim voc and pairs
pairs = trimRareWords(voc,pairs,MIN_COUNT)

keep_words 7823 / 18005 = 0.4345
Trimmed from 64271 pairs to 53165, 0.8272 of total


# Prepare Data for Models

Although we have put a great deal of effort into preparing and massaging our data into a nice vocabulary object and list of sentence pairs, our models will ultimately **expect numerical torch tensors as inputs.** One way to prepare the processed data for the models can be found in the **[seq2seq translation tutorial](https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).** In that tutorial, we use a batch size of 1, meaning that all we have to do is convert the words in our sentence pairs to their corresponding indexes from the vocabulary and feed this to the models.

However, if you’re interested in speeding up training and/or would like to leverage **GPU parallelization** capabilities, you will need **to train with mini-batches.**

Using mini-batches also means that **we must be mindful of the variation of sentence length in our batches.** To accomodate sentences of different sizes in the same batch, we will make our batched input tensor of shape (max_length, batch_size), where **sentences shorter than the max_length are zero padded after an EOS_token.**

If we simply convert our English sentences to tensors by converting words to their indexes(indexesFromSentence) and zero-pad, our tensor would have shape (batch_size, max_length) and indexing the first dimension would return a full sequence across all time-steps. However, we need to be able to index our batch along time, and across all sequences in the batch. Therefore, **we transpose our input batch shape to (max_length, batch_size), so that indexing across the first dimension returns a time step across all sentences in the batch.** We handle this transpose implicitly in the 
zeroPadding function.

**Figure Below**

Each word is represented by an index. In other words, each number represents a different word. 

In the matrix the number of columns is the _max_length_ and the number of rows is the _batch_size_. The maximum length of the sentence was configured by us as 10. So the maximum number of columns will be 10. 

We transpose the matrix because we are processing it in batchs. So, each row will be passed in each timestamp, or in each LSTM. We take one row at the time and feed the timestamp of the LSTM.

![](gpu_batch.png)

In [35]:
#get the index of each sentences
def indexesFromSentence(voc, sentence):
    return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]

In [36]:
pairs[1][0]

'you have my word . as a gentleman'

In [37]:
# test the function
indexesFromSentence(voc,pairs[1][0])

#Remember at the end we a index that represents the token word
# here the end punct(.) is identified as a index 2

[7, 8, 9, 10, 4, 11, 12, 13, 2]

In [38]:
#Define some samples for testing
inp = []
out = []
for pair in pairs[:10]:   # selecting the first 10 sentences
    inp.append(pair[0])
    out.append(pair[1])
    
print(inp)
print(len(inp))
indexes = [indexesFromSentence(voc,sentence) for sentence in inp]
indexes

['there .', 'you have my word . as a gentleman', 'hi .', 'have fun tonight ?', 'well no . . .', 'then that s all you had to say .', 'but', 'do you listen to this crap ?', 'what good stuff ?', 'wow']
10


[[3, 4, 2],
 [7, 8, 9, 10, 4, 11, 12, 13, 2],
 [16, 4, 2],
 [8, 31, 22, 6, 2],
 [33, 34, 4, 4, 4, 2],
 [35, 36, 37, 38, 7, 39, 40, 41, 4, 2],
 [42, 2],
 [47, 7, 48, 40, 45, 49, 6, 2],
 [50, 51, 52, 6, 2],
 [58, 2]]

## Understanding the Zip Function (zero padding)

We need make the number of columns consistent and order in descendent manner.

In [39]:
# Learning some extra functions that are helpfull in python

a = ['A','B','C']
b = [1, 2, 3]
list(zip(a,b))

[('A', 1), ('B', 2), ('C', 3)]

In [40]:
# If we have more elements in the "a" list than the "b" list the extra elements will be ingored. 
# But we have another function zip_longest that belongs to itertools method that allow us see these extra elements.
a = ['A','B','C','D','E']
b = [1, 2, 3]
list(itertools.zip_longest(a,b))

[('A', 1), ('B', 2), ('C', 3), ('D', None), ('E', None)]

In [41]:
# Now we will be use the zip_longest for prepare the matrix.
# See what happens when we zip the next list.
a = [[3, 4, 2],
 [7, 8, 9, 10, 4, 11, 12, 13, 2],
 [16, 4, 2],
 [8, 31, 22, 6, 2],
 [33, 34, 4, 4, 4, 2],
 [35, 36, 37, 38, 7, 39, 40, 41, 4, 2],
 [42, 2],
 [47, 7, 48, 40, 45, 49, 6, 2],
 [50, 51, 52, 6, 2],
 [58, 2]]

print(list(itertools.zip_longest(*a)))

# We see when we don't have elements to zip it it fills with None. 
# And wee see the fist element of this list is the first columun of the input list.
# And we see the most longest list is the first list that appears in the nested list. So, it order as we want.

print('\nfirst element of the list')
list(itertools.zip_longest(*a))[0]

[(3, 7, 16, 8, 33, 35, 42, 47, 50, 58), (4, 8, 4, 31, 34, 36, 2, 7, 51, 2), (2, 9, 2, 22, 4, 37, None, 48, 52, None), (None, 10, None, 6, 4, 38, None, 40, 6, None), (None, 4, None, 2, 4, 7, None, 45, 2, None), (None, 11, None, None, 2, 39, None, 49, None, None), (None, 12, None, None, None, 40, None, 6, None, None), (None, 13, None, None, None, 41, None, 2, None, None), (None, 2, None, None, None, 4, None, None, None, None), (None, None, None, None, None, 2, None, None, None, None)]

first element of the list


(3, 7, 16, 8, 33, 35, 42, 47, 50, 58)

In [42]:
# If we want fill the None value by zero it easy:
list(itertools.zip_longest(*a, fillvalue = 0))

[(3, 7, 16, 8, 33, 35, 42, 47, 50, 58),
 (4, 8, 4, 31, 34, 36, 2, 7, 51, 2),
 (2, 9, 2, 22, 4, 37, 0, 48, 52, 0),
 (0, 10, 0, 6, 4, 38, 0, 40, 6, 0),
 (0, 4, 0, 2, 4, 7, 0, 45, 2, 0),
 (0, 11, 0, 0, 2, 39, 0, 49, 0, 0),
 (0, 12, 0, 0, 0, 40, 0, 6, 0, 0),
 (0, 13, 0, 0, 0, 41, 0, 2, 0, 0),
 (0, 2, 0, 0, 0, 4, 0, 0, 0, 0),
 (0, 0, 0, 0, 0, 2, 0, 0, 0, 0)]

In [43]:
# Now we know how thwe zip function works and we are able to proceed:
def zeropadding(l,fillvalue = 0):
    return list(itertools.zip_longest(*l, fillvalue = fillvalue))

In [44]:
#test the function
test_result = zeropadding(indexes)
print('maximum length: ',len(test_result))
test_result

maximum length:  10


[(3, 7, 16, 8, 33, 35, 42, 47, 50, 58),
 (4, 8, 4, 31, 34, 36, 2, 7, 51, 2),
 (2, 9, 2, 22, 4, 37, 0, 48, 52, 0),
 (0, 10, 0, 6, 4, 38, 0, 40, 6, 0),
 (0, 4, 0, 2, 4, 7, 0, 45, 2, 0),
 (0, 11, 0, 0, 2, 39, 0, 49, 0, 0),
 (0, 12, 0, 0, 0, 40, 0, 6, 0, 0),
 (0, 13, 0, 0, 0, 41, 0, 2, 0, 0),
 (0, 2, 0, 0, 0, 4, 0, 0, 0, 0),
 (0, 0, 0, 0, 0, 2, 0, 0, 0, 0)]

In [45]:
leng = [len(ind) for ind in indexes]
max(leng)  # only to cehck if the maxium lengh is 10

10

In [46]:
leng

[3, 9, 3, 5, 6, 10, 2, 8, 5, 2]

In [47]:
# Converts our index tensor in a binary tensor (0's  and 1's)
def binaryMatrix(l, value=0):
    m = []
    for i, seq in enumerate(l):
        m.append([])
        for token in seq:
            if token == PAD_token:
                m[i].append(0)
            else:
                m[i].append(1)
    return m

In [48]:
binary_result = binaryMatrix(test_result)
binary_result

[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 [1, 1, 1, 1, 1, 1, 0, 1, 1, 0],
 [0, 1, 0, 1, 1, 1, 0, 1, 1, 0],
 [0, 1, 0, 1, 1, 1, 0, 1, 1, 0],
 [0, 1, 0, 0, 1, 1, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 1, 0, 0],
 [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]

In [49]:
#returns padded input sequence tensor and as well as a tensor of lenghts for each of the sequences in the batch
def inputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc,sentence) for sentence in l]
    lenghts = torch.tensor([len(indexes) for indexes in indexes_batch])
    padList = zeropadding(indexes_batch)
    padVar = torch.LongTensor(padList)
    return padVar, lenghts

In [50]:
# Returns padded target sequence tensor, padding mask, and target length
def outputVar(l, voc):
    indexes_batch = [indexesFromSentence(voc,sentence) for sentence in l]
    max_target_len = max([len(indexes) for indexes in indexes_batch])
    padList = zeropadding(indexes_batch)
    mask = binaryMatrix(padList)
    mask = torch.ByteTensor(mask)
    padVar = torch.LongTensor(padList)
    return padVar, mask, max_target_len

In [51]:
# Returns all items for a given batch pairs
def batch2TrainData(voc, pair_batch):
    #Sort the questions in descending length
    # We take the quention [0] split it by " " and return the length of the sentence
    # It will sort by the key which is the lenght of the sentence
    pair_batch.sort(key=lambda x: len(x[0].split(" ")), reverse = True)
    input_batch, output_batch = [], []
    for pair in pair_batch:
        input_batch.append(pair[0])
        output_batch.append(pair[1])
    inp, lenghts = inputVar(input_batch, voc)
    # assert len(inp) == lenghts[0]
    output, mask, max_target_len = outputVar(output_batch, voc)
    return inp, lenghts, output, mask, max_target_len

In [52]:
#Understanding the lambda function
def add(x):
    return x + 1
print(add(2))

add_l = lambda x: x + 1 #after semi colunms (:) we have the return of the lambda function
print(add_l(2))

3
3


In [53]:
# Test the Function
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])
input_variable, lenghts, target_variable, mask, max_target_len = batches

print("input_variable:")
print(input_variable)
print("lenghts: ", lenghts)
print("target_variable: ")
print(target_variable)
print("mask:")
print(mask)
print("max_target_len:", max_target_len)

input_variable:
tensor([[  50,   68,   70,    7,    7],
        [  37,    7,  292,  197,   14],
        [  61,  723, 3614,  117,   12],
        [ 534,   53,  203, 6195,  107],
        [  40,  140,   29,   70,  164],
        [  47,   53,  977, 7126, 4692],
        [ 169, 4410,  702,  358,    4],
        [  76,    6,    4,    4,    2],
        [   6,    2,    2,    2,    0],
        [   2,    0,    0,    0,    0]])
lenghts:  tensor([10,  9,  9,  9,  8])
target_variable: 
tensor([[  61,  167,   25,   34,   64],
        [  37,   11,  197,    4,  619],
        [ 123, 1054,  117,    2,    7],
        [ 169,   11,  118,    0,  112],
        [  83,   27,  253,    0,  215],
        [   4,  534,  217,    0,    2],
        [   2,  479,    4,    0,    0],
        [   0,    4,    2,    0,    0],
        [   0,    2,    0,    0,    0]])
mask:
tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 0, 1],
        [1, 1, 1, 0, 1],
        [1, 1, 1, 0, 1],
        [

# Defining Models

## seq2seq Model

The brains of our chatbot is a sequence-to-sequence (seq2seq) model. The goal of a seq2seq model is to take a variable-length sequence as an input, and return a variable-length sequence as an output using a fixed-sized model.

Sutskever et al. discovered that by **using two separate recurrent neural nets together**, we can accomplish this task. **One RNN acts as an encoder**, which encodes a variable length input sequence to a fixed-length context vector. In theory, this context vector (the final hidden layer of the RNN) will contain semantic information about the query sentence that is input to the bot. **The second RNN is a decoder**, which takes an input word and the context vector, and returns a guess for the next word in the sequence and a hidden state to use in the next iteration.

![](enc_dec.png)

The setps to be taken are:

        1. Convert word indexes to embeddings.
        2. Pack padded batch of sequences for RNN module.
        3. Forward pass through GRU.
        4. Unpack padding.
        5. Sum bidirectional GRU outputs.
        6. Return output and final hidden state.

We will use a bidirectional variant of the GRU, meaning that there are essentially two independent RNNs: one that is fed the input sequence in normal sequential order, and one that is fed the input sequence in reverse order. The outputs of each network are summed at each time step. Using a bidirectional GRU will give us the advantage of encoding both past and future context.

## Encoder

In [54]:
#hidden_size : it is the size of the hidden layer, or the nuber of the neurons that we have in the hidden layer
#embedding : responsable to convert the index to a dense vector value
class EncoderRNN(nn.Module):
    def __init__(self, hidden_size, embedding, n_layers=1, dropout=0):
        super(EncoderRNN, self).__init__()  # Explains the super function: https://realpython.com/python-super/
        self.n_layers = n_layers
        self.hidden_size = hidden_size
        self.embedding = embedding

        # documentation of GRU: https://pytorch.org/docs/stable/nn.html?highlight=gru#torch.nn.GRU        
        # Initialize GRU; the input_size and hidden_size params are both set to 'hidden_size'
        #   because our input size is a word embedding with number of features == hidden_size
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers,
                          dropout=(0 if n_layers == 1 else dropout), bidirectional=True)

    def forward(self, input_seq, input_lengths, hidden=None):
        # input_seq: batch of input sentences; shape=(max_length, bathc_size)
        # input_lenghts: list of sentence lengths corresponding to each sentene in the batch
        # hidden state of shape: (n_layers x num_directions, batch_size, hidden_size )
        # num_directions = 2 because we are in the bidirectional
        # Convert word indexes to embeddings
        embedded = self.embedding(input_seq)
        # Pack padded batch of sequences for RNN module
        packed = nn.utils.rnn.pack_padded_sequence(embedded, input_lengths)
        # Forward pass through GRU
        outputs, hidden = self.gru(packed, hidden)
        # Unpack padding
        outputs, _ = nn.utils.rnn.pad_packed_sequence(outputs)
        # Sum bidirectional GRU outputs
        outputs = outputs[:, :, :self.hidden_size] + outputs[:, : ,self.hidden_size:]
        # Return output and final hidden state
        return outputs, hidden
    
        # outputs: the output features h_t from last layer of the GRU, for each timestep (sum of bidirectional outputs)
        # outputs shape=(max_lenght, batch-size,hidden_size)
        # hidden: hidden state for the last timestep, of shape=(n_layers x num_directions, batch_size, hidden_size)

In [55]:
a = torch.randn(5,4,3) # 5 channels, 4 rows, 3 columns
a                      # we gave 4 rows and 3 coluns in every channel 

tensor([[[ 0.5350, -3.2309, -0.7471],
         [ 0.0332, -2.2909,  0.7739],
         [-2.1880,  0.1024, -1.0335],
         [-1.2431,  0.8503, -2.1802]],

        [[-0.2679,  0.6516, -0.2425],
         [-0.3492,  0.4839, -1.0999],
         [-1.1370,  0.7864, -0.4613],
         [ 0.4408,  1.6251, -0.4104]],

        [[-0.3011, -0.2723,  0.7967],
         [-0.8307, -0.7087,  0.4237],
         [ 0.9820, -0.6979,  2.3737],
         [-0.2901, -1.2028,  1.1466]],

        [[-1.3417, -0.8210, -0.4455],
         [-1.4966,  0.9845,  1.2509],
         [-0.9394, -0.6506, -1.6857],
         [-0.7775, -0.1180, -0.7833]],

        [[ 2.2085, -0.3587,  0.2120],
         [ 1.3467, -1.0997, -0.6819],
         [-0.0721,  0.8381, -0.8725],
         [-1.3625,  1.8226,  0.3858]]])

In [56]:
a[:,:,:2] # we want to see all channels, all rows, and columns from 0 until 2 exclusive

tensor([[[ 0.5350, -3.2309],
         [ 0.0332, -2.2909],
         [-2.1880,  0.1024],
         [-1.2431,  0.8503]],

        [[-0.2679,  0.6516],
         [-0.3492,  0.4839],
         [-1.1370,  0.7864],
         [ 0.4408,  1.6251]],

        [[-0.3011, -0.2723],
         [-0.8307, -0.7087],
         [ 0.9820, -0.6979],
         [-0.2901, -1.2028]],

        [[-1.3417, -0.8210],
         [-1.4966,  0.9845],
         [-0.9394, -0.6506],
         [-0.7775, -0.1180]],

        [[ 2.2085, -0.3587],
         [ 1.3467, -1.0997],
         [-0.0721,  0.8381],
         [-1.3625,  1.8226]]])

### Understanding pack-padded_sequence
![](pad_padded_sequence.png)


## Decoder
The decoder RNN generates the response sentence in a token-by-token fashion. It uses the encoder’s context vectors, and internal hidden states to generate the next word in the sequence. We will use an "attention mechanism” that allows the decoder to pay attention to certain parts of the input sequence, rather than using the entire fixed context at every step. At a high level, attention is calculated using the decoder’s current hidden state and the encoder’s outputs. We will also use "Global attention", where we consider al of encoder's hidden states, as oppossed to "Local attention", which only considers the enconder's hidden state from the current time step, as well as calculate attention weights using the hidden staate of decoder fro the currtent time step only. The output of this attention module is a softmax normalized weights tensor of shape (batch_size, 1, max_length).

In Global Attention, there are various methods to calculate the attention energies between the enconder output and decoder output which are called "score functions": _ht_ is the current target decoder state and _hs_bar_ is the all enconder states.

![](attention_opperation.png)


## Attention Mechanism Diagram

![](attention_mech2.png)

In [57]:
#Luong attention layer
class Attn(nn.Module):
    def __init__(self, method, hidden_size):
        super(Attn, self).__init__()
        self.method = method
        self.hidden_size = hidden_size
        
    def dot_score(self, hidden, encoder_output):
        # Element-Wise Multiply the current target decoder state with the encoder output and sum them
        return torch.sum(hidden * encoder_output, dim=2)
    
    def forward(self, hidden, encoder_outputs):
        
        #It is the forward propagation.
        #It is how the Attention is calculated
        
        
        # hidden of shape: (1, batch_size, hidden_size) -> decoder output for Attention Mechanism
        # 1 because we are calculating one GRU per time step
        # encoder_outputs of shape: (max_length, batch_size, hidden_size)
        # (1, batch_size, hidden_size) * (max_length, batch_size, hidden_size) = (max_length, batch_size, hidden_size)
        
        # Calculate the attention weights(energies)
        attn_energies = self.dot_score(hidden, encoder_outputs)      # (max_length, batch_size)
        # Transpose max_length and batch-size dimensions
        attn_energies = attn_energies.t()                            # (batch_size, max_length)
        # return the softmax normalized probability scores (with added dimension)
        return F.softmax(attn_energies, dim=1).unsqueeze(1)          # (batch_size, 1, max_length)  
        #unsqueeze(1) add one dimension in the tensor (batch_size, "1", max_length)

In [58]:
# Only to explain the sum(dim=2) to understand the softmax(dim=1) is calculated for every row. 
#lenght size: 5 (channels)  ;    batch size : 3 (rows)     ; hidden size: 7 (columns)
import torch
a = torch.randn(5,3,7)
a 

tensor([[[ 1.1128,  0.3221, -0.8535,  1.8587, -0.4180,  0.4823, -0.1285],
         [-1.1753, -0.2309,  0.6065, -0.5781,  0.4912,  1.8851,  1.6029],
         [-0.2010, -0.5660,  2.0702, -0.6899,  0.2949, -0.9173, -0.6271]],

        [[-0.1248, -1.9668,  0.4997, -0.8772, -1.3751,  0.5139, -1.1419],
         [-0.6385,  0.8459, -0.5322, -0.2022,  0.2896, -0.6354,  1.0609],
         [ 0.2074,  0.0594, -0.5087,  1.3032, -0.7743,  0.9145, -0.1681]],

        [[ 0.0874,  1.5584, -0.9259,  1.7368, -0.7374, -0.5676,  1.3549],
         [ 2.2512, -0.5716,  0.5974, -0.8074, -0.1143, -0.5023,  0.4574],
         [-1.1242,  0.3946, -0.5153,  0.9565,  0.1177, -0.3228, -0.6198]],

        [[ 0.2402, -0.5952, -0.7769, -0.9336, -0.5321,  0.9421, -1.1277],
         [-1.0038,  0.4346, -0.2831, -1.7647,  1.9026, -0.2814,  0.4537],
         [ 0.0897, -1.1581, -0.1595,  0.8827,  1.1668, -1.3155,  0.9734]],

        [[ 0.3748, -0.1463, -1.4988,  0.2489, -1.3239,  1.0366,  0.2376],
         [-1.0118, -0.3550,  0

In [59]:
print(torch.sum(a,dim=2)) # sim across the columns

# first row
print(sum([ 0.0657, -0.4565,  0.7132, -3.2464,  0.4982,  0.6139,  1.4410]))

tensor([[ 2.3760,  2.6014, -0.6362],
        [-4.4722,  0.1882,  1.0334],
        [ 2.5066,  1.3104, -1.1133],
        [-2.7832, -0.5422,  0.4795],
        [-1.0712,  0.1812,  1.0792]])
-0.37089999999999956


In [60]:
# demonstrate the softmax
import torch.nn.functional as F
a = torch.rand(5,7)
a

tensor([[0.0756, 0.2722, 0.6306, 0.9543, 0.3732, 0.5185, 0.4947],
        [0.0587, 0.1478, 0.7291, 0.1109, 0.4348, 0.8934, 0.1815],
        [0.6971, 0.7416, 0.5465, 0.3413, 0.0294, 0.8073, 0.4629],
        [0.1306, 0.1105, 0.1725, 0.1664, 0.9383, 0.5485, 0.5198],
        [0.0624, 0.7113, 0.9893, 0.7340, 0.9771, 0.7764, 0.6723]])

In [61]:
 b = F.softmax(a,dim=1)

In [62]:
b[0].sum()

tensor(1.0000)

In [63]:
for i in range(0,5):
    print(b[i].sum())

tensor(1.0000)
tensor(1.0000)
tensor(1.)
tensor(1.)
tensor(1.)


In [64]:
new_list = list(range(0,5))
list(map(lambda x: print(b[x].sum()),new_list))

tensor(1.0000)
tensor(1.0000)
tensor(1.)
tensor(1.)
tensor(1.)


[None, None, None, None, None]

## Designing the Decoder

Now we are use our Attention to implement our Decoder

For the decoder, we will manually feed our batch one time step at a time. This means that our embedded word tensor and GRU output will both have shape (1, batch_size, hidden_size). The steps are:

       1. Get embedding of current input word.
       2. Forward through unidirectional GRU.
       3. Calculate attention weights from the current GRU output from (2).
       4. Multiply attention weights to encoder outputs to get new “weighted sum” context vector.
       5. Concatenate weighted context vector and GRU output using Luong eq. 5.
       6. Predict next word using Luong eq. 6 (without softmax).
       7. Return output and final hidden state.



- output of shape (seq_len,batch_num, num_directions*hidden_size)  -> go to pytorch documentation to see this information
-  seq_len = 1 becasuse we are calculating one GRU per time step
- num_direction = 2 because we are using bidirectional
- h_n is the hidden of your last GRU output. We are not working with a sequency of a GRU. We are working with one GRU at the time. The hidden state return by the GRU 14 *1 

In [65]:
class LuongAttnDecoderRNN(nn.Module):
    def __init__(self, attn_model, embedding, hidden_size, output_size, n_layers=1, dropout=0.1):
        super(LuongAttnDecoderRNN, self).__init__()

        # Keep for reference
        self.attn_model = attn_model
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout = dropout
        
        #Embeddings: The inputs of shape (1, batch_size) is only the indexes of the words 
        #we need to transform them to an embedding dense vector. This represents the feactures of the words.
        #We need to use it every time when we have words.       
        # Defining Layers: dropout reduces the overfitting
        
        # Define layers
        self.embedding = embedding
        self.embedding_dropout = nn.Dropout(dropout)
        self.gru = nn.GRU(hidden_size, hidden_size, n_layers, dropout=(0 if n_layers == 1 else dropout))
        self.concat = nn.Linear(hidden_size * 2, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)

        self.attn = Attn(attn_model, hidden_size)  #initialize the attention model inside of the class

    def forward(self, input_step, last_hidden, encoder_outputs):
        # input_step: one time step (one word) of input sequence batch; shpae=(1, batch_size)
        # last_hidden: final hidden state of encoder GRU; shape=(n_layers x num_directions, batch_size, hidden_size)
        # encoder_outputs: enconder model's output; shape=(seq_len, batch, num_directions x hidden_size)
        # Note: we run this one step (batch of words) at a time
        
        # Note: we run this one step (word) at a time
        # Get embedding of current input word
        embedded = self.embedding(input_step)                      # represents a word as features
        embedded = self.embedding_dropout(embedded)
        # Forward through unidirectional GRU
        rnn_output, hidden = self.gru(embedded, last_hidden)    # input step and the last hidden it is the encoder output
        # rnn_output of shape (seq_len, batch, num_directions * hidden_size)   
        # hidden of shape (num_layers * num_directions, batch, hidden_size)
        
        # Calculate attention weights from the current GRU output
        attn_weights = self.attn(rnn_output, encoder_outputs)
        
        # Multiply attention weights to enconder outputs to get new "weighted sum" context vector
        # (batch_size, 1 , max_lenght) bmm with (batch_size, max_length, hidden) = (batch_size,1, hidden)
        # context vector: shape(batch_size,1, hidden)
        # bmm -> batch multiply function to multiply 3D tensor
        #We see in the diagram
        #    - encoder output: shape(max_length, batch_size, hidden_size)
        #    - attention output: shape(batch_size,1,max_length)
        #So we are multiply (attn_weights with enconder_outputs) using bmm
        context = attn_weights.bmm(encoder_outputs.transpose(0, 1))
        
        # Concatenate weighted context vector and GRU output using Luong eq. 5
        rnn_output = rnn_output.squeeze(0) # we skip the first dimension of GRU output
                                           # from (1, batch_size, hidden_size) to (batch_size, hidden_size)
            
        context = context.squeeze(1)       # we skip the first dimension of context output
                                           # from (batch_size, 1, hidden_size) to (batch_size, hidden)
            
        concat_input = torch.cat((rnn_output, context), 1)               # concatenate both two along the columns
        concat_output = torch.tanh(self.concat(concat_input))
        # Predict next word using Luong eq. 6
        output = self.out(concat_output)    # the columns represents the distributions of the vocabulary words
        output = F.softmax(output, dim=1)
        # Return output and final hidden state
        return output, hidden
    
        # output: softmax normalized tensor giving proabilities of each word being the correct next word in 
        # the decoded sequence
        # shape = (batch_size, voc.num_words)
        # hidden: final hidden state of GRU; shpae=(n_layers x num_directions, batch_size, hidden_size)

### We're done with building the Architecture. Now Let's move on to the Training code

## Creating the Loss Function

Since we are dealing with batches of padded sequences, we cannot simply consider all elements of the tensor when calculating loss. We define maskNLLLoss to calculate our loss based on our decoder's output tensor, the target tensor, and a binary mask tensor describing the padding of the targer tensor. This loss function calculates the average negative log likelihood of the lements that correspond to a 1 in the mask tensor.


In [66]:
def maskNLLLoss(decoder_out, target, mask):  
    
    """
    NLLLoss : Negative LikeLihood Loss
    The mask is returned by 'batch2TrainData' function
    """
    
    nTotal = mask.sum()        # How many elements should we consider
    target = target.view(-1,1) # because we use torch.gather.
                               # -1 python decides the dimension-> in this case is batch_size, 1 dimension=1
        
    # decoder_out shape: (batch_size, vocab_size), target_size = (batch_size, 1)
    gathered_tensor = torch.gather(decoder_out, 1, target)
    # calculate the Negative Log Likelihood Loss
    crossEntropy = -torch.log(gathered_tensor)
    # Select the non-zero elements
    loss = crossEntropy.masked_select(mask) # according with this mask it tasks 
                                            # the correspond element of thhe crossEntropy tensor
    #calculate the mean of the Loss
    loss = loss.mean()
    loss = loss.to(device)                  # transform to the CUDA otherwise transform to the GPU
    return loss, nTotal.item()

In [67]:
# Visualize what's hapening in the Loss Function
# decoder_out shape: (batch_size, vocab_size), target_size = (batch_size,1)
dec_o = torch.rand(5,7)
dec_o = F.softmax(dec_o, dim=1)
tar = torch.tensor([2, 1, 5, 4, 0], dtype = torch.long)
tar = tar.view(-1,1)
mask = torch.tensor([1, 0, 1, 1, 0], dtype = torch.uint8)
print(dec_o)
print(tar)
gath_ten = torch.gather(dec_o, 1 ,tar) # Get the softmax scores for the expected correct predictions
print(gath_ten)
print(gath_ten.shape)
crossEntropy= - torch.log(gath_ten)
print("Cross Entropy:")
print(crossEntropy)
mask = mask.unsqueeze(1)
loss = crossEntropy.masked_select(mask)
print("Loss:")
print(loss)
print(loss.shape)
print("Sum of mask elements (How many elements we are considering): ", mask.sum())
print("Mean of the Loss: ", loss.mean())
print("Mean of the cross-entropy loss (without masking):", crossEntropy.mean())

tensor([[0.0873, 0.2221, 0.1112, 0.1786, 0.1105, 0.1649, 0.1254],
        [0.1110, 0.1294, 0.2256, 0.1011, 0.1242, 0.1404, 0.1682],
        [0.1451, 0.1179, 0.1034, 0.2115, 0.0959, 0.1764, 0.1497],
        [0.1675, 0.0988, 0.2395, 0.1084, 0.1264, 0.1477, 0.1116],
        [0.1095, 0.1660, 0.1252, 0.1935, 0.1237, 0.1825, 0.0995]])
tensor([[2],
        [1],
        [5],
        [4],
        [0]])
tensor([[0.1112],
        [0.1294],
        [0.1764],
        [0.1264],
        [0.1095]])
torch.Size([5, 1])
Cross Entropy:
tensor([[2.1961],
        [2.0445],
        [1.7350],
        [2.0686],
        [2.2118]])
Loss:
tensor([2.1961, 1.7350, 2.0686])
torch.Size([3])
Sum of mask elements (How many elements we are considering):  tensor(3)
Mean of the Loss:  tensor(1.9999)
Mean of the cross-entropy loss (without masking): tensor(2.0512)


## Teaching Forcing 

No Teaching Force: If the a word was wrong generated it will mess the all predicted words in the next time steps

Teaching Force: If a word was wrong generated  the method is safety by the fact the word that is used as input for the enxt time step is not the output of the previous time step but is the label of the correct word.

In this project we will use No Teaching Forcing in 50% of the time and Teaching Force in another 50% of the time. 

![](teach_force.png)


We will use **Teaching Forcing** in Training. This means that at some probability, set by teacher_forcing-ratio, we use the current target word  as the decoder’s next input rather than using the decoder’s current guess. This technique acts as training wheels for the decoder, aiding in more efficient training. However, teacher forcing can lead to model instability during inference, as the decoder may not have a sufficient chance to truly craft its own output sequences during training. Thus, we must be mindful of how we are setting the teacher_forcing_ratio, and not be fooled by fast convergence. The second trick that we implement is **Gradient Clipping**. This is a commonly used technique for countering the “exploding gradient” problem. In essence, by clipping or thresholding gradients to a maximum value, we prevent the gradients from growing exponentially and either overflow (NaN), or overshoot steep cliffs in the cost function.

**Sequence of Operations in Training:**

        1. Forward pass entire input batch through encoder.
        2. Initialize decoder inputs as SOS_token, and hidden state as the encoder’s final hidden state.
        3. Forward input batch sequence through decoder one time step at a time.
        4. If teacher forcing: set next decoder input as the current target; 
             else: set next decoder input as current decoder output.
        5. Calculate and accumulate loss.
        6. Perform backpropagation.
        7. Clip gradients.
        8. Update encoder and decoder model parameters.
        
Before move on to Training, let's see a live training and waht's happening with the data: 

In [69]:
# Visualizing what's happening in one iteration. Only rus this for visualization.
small_batch_size = 5
batches = batch2TrainData(voc, [random.choice(pairs) for _ in range(small_batch_size)])

input_variable, lengths, target_variable, mask, max_target_len = batches

print("input_variable shape:", input_variable.shape)
print("lengts shape:", lengths.shape)
print("target_variable shape:", target_variable.shape)
print("mask shape:", mask.shape)
print("max_target_len  shape:", max_target_len)

# define the parameters
hidden_size= 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
attn_model = 'dot'
embedding = nn.Embedding(voc.num_words, hidden_size)

#Define the encoder and Decoder
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
encoder = encoder.to(device)
decoder = decoder.to(device)



# Ensure dorpout Layers are in train mode
encoder.train()
decoder.train()

# initialize optimizers
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.0001)  # gradient descendent approach: Adam optimization
decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.0001)  # lr: learning rate good initiate with low value
encoder_optimizer.zero_grad()                                    # we dont want stuck in the zero gradient
decoder_optimizer.zero_grad()

input_variable = input_variable.to(device)
lengths = lengths.to(device)
target_variable = target_variable.to(device)
mask = mask.to(device)

loss = 0
print_losses = []   # initialize an empty list to store the losses
n_totals = 0

encoder_outputs, encoder_hidden = encoder(input_variable, lengths)
print("Encoder Outputs Shape:", encoder_outputs.shape)
print("Last Encoder Hidden Shape", encoder_hidden.shape)

decoder_input = torch.LongTensor([[SOS_token for _ in range(small_batch_size)]])
decoder_input = decoder_input.to(device)
print("Initial decoder Input Shape:", decoder_input.shape)
print(decoder_input)

#Set initial decoder hidden state to the encoder's finl hidden state
decoder_hidden = encoder_hidden[:decoder.n_layers]
print("Initial Decoder hidden state shape:", decoder_hidden.shape)
print("\n")
print("---------------------------------------------------------------------")
print("Now Let's look what's happening in every tiimestep of the GRU")
print("---------------------------------------------------------------------")
print("\n")

#Assume we are using teacher Forcing
"""
How determine the number of time steps of decoder? 
We take the process dataset and look the sentence that have the maximum length.
So, how many words this sentence have is basically the number of timesteps.
The maximum timesteps is the maximum words you have.
"""

for t in range(max_target_len):
    decoder_output, decoder_hidden = decoder(decoder_input, decoder_hidden, encoder_outputs)
    print("Decoder Output shape:", decoder_output.shape)
    print("Decoder Hidden Shape:", decoder_hidden.shape)
    # Teacher Forcing: next input is current target
    decoder_input = target_variable[t].view(1,-1)
    print("The target varaible at the current timestep before reshaping:", target_variable[t])
    print("The target variable at the current timestep shape before reshaping:", target_variable[t].shape)
    print("The Decoder input shape (reshape the target variable):", decoder_input.shape)
    #calculate and accumulate Loss
    print("The mask of the current timestep:", mask[t])
    print("The mask at the current timestep shape:", mask[t].shape)
    mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
    print("Mask loss:", mask_loss)
    print("Total:", nTotal)
    loss += mask_loss
    print_losses.append(mask_loss.item() * nTotal)
    print(print_losses)
    n_totals += nTotal
    print(n_totals)
    encoder_optimizer.step()                          # updating the weights
    decoder_optimizer.step()                          # updating the weights
    returned_loss = sum(print_losses) / n_totals
    print("Returned Loss:", returned_loss)
    print("\n")
    print("---------------------- DONE ONE TIMESTEP ----------------------")
    print("\n")
    


input_variable shape: torch.Size([8, 5])
lengts shape: torch.Size([5])
target_variable shape: torch.Size([10, 5])
mask shape: torch.Size([10, 5])
max_target_len  shape: 10
Encoder Outputs Shape: torch.Size([8, 5, 500])
Last Encoder Hidden Shape torch.Size([4, 5, 500])
Initial decoder Input Shape: torch.Size([1, 5])
tensor([[1, 1, 1, 1, 1]], device='cuda:0')
Initial Decoder hidden state shape: torch.Size([2, 5, 500])


---------------------------------------------------------------------
Now Let's look what's happening in every tiimestep of the GRU
---------------------------------------------------------------------


Decoder Output shape: torch.Size([5, 7826])
Decoder Hidden Shape: torch.Size([2, 5, 500])
The target varaible at the current timestep before reshaping: tensor([651, 348, 597,  35, 660], device='cuda:0')
The target variable at the current timestep shape before reshaping: torch.Size([5])
The Decoder input shape (reshape the target variable): torch.Size([1, 5])
The mask of t

In [87]:
def train(input_variable, lengths, target_variable, mask, max_target_len, encoder, decoder, embedding,
          encoder_optimizer, decoder_optimizer, batch_size, clip, max_length=MAX_LENGTH):

    # Zero gradients
    # we dont want stuck in the zero gradient
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    # Set device options
    input_variable = input_variable.to(device)
    lengths = lengths.to(device)
    target_variable = target_variable.to(device)
    mask = mask.to(device)

    # Initialize variables
    loss = 0
    print_losses = []
    n_totals = 0

    # Forward pass through encoder
    encoder_outputs, encoder_hidden = encoder(input_variable, lengths)

    # Create initial decoder input (start with SOS tokens for each sentence)
    decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
    decoder_input = decoder_input.to(device)

    # Set initial decoder hidden state to the encoder's final hidden state
    decoder_hidden = encoder_hidden[:decoder.n_layers]

    # Determine if we are using teacher forcing this iteration
    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    # Forward batch of sequences through decoder one time step at a time
    if use_teacher_forcing:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # Teacher forcing: next input is current target
            decoder_input = target_variable[t].view(1, -1)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal
    else:
        for t in range(max_target_len):
            decoder_output, decoder_hidden = decoder(
                decoder_input, decoder_hidden, encoder_outputs
            )
            # No teacher forcing: next input is decoder's own current output
            _, topi = decoder_output.topk(1)
            decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
            decoder_input = decoder_input.to(device)
            # Calculate and accumulate loss
            mask_loss, nTotal = maskNLLLoss(decoder_output, target_variable[t], mask[t])
            loss += mask_loss
            print_losses.append(mask_loss.item() * nTotal)
            n_totals += nTotal

    # Perform backpropatation
    loss.backward()

    # Clip gradients: gradients are modified in place
    _ = nn.utils.clip_grad_norm_(encoder.parameters(), clip)
    _ = nn.utils.clip_grad_norm_(decoder.parameters(), clip)

    # Adjust model weights
    encoder_optimizer.step()
    decoder_optimizer.step()

    return sum(print_losses) / n_totals

We save a tarball containing the encoder and decoder state_dicts (parameters), the optimizers’ state_dicts, the loss, the iteration, etc. Saving the model in this way will give us the ultimate flexibility with the checkpoint. After loading a checkpoint, we will be able to use the model parameters to run inference, or we can continue training right where we left off.

In [88]:
def trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer, embedding, 
               encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size, print_every, 
               save_every, clip, corpus_name, loadFilename):

    # Load batches for each iteration
    training_batches = [batch2TrainData(voc, [random.choice(pairs) for _ in range(batch_size)])
                      for _ in range(n_iteration)]

    # Initializations
    print('Initializing ...')
    start_iteration = 1
    print_loss = 0
    if loadFilename:
        start_iteration = checkpoint['iteration'] + 1

    # Training loop
    print("Training...")
    for iteration in range(start_iteration, n_iteration + 1):
        training_batch = training_batches[iteration - 1]
        # Extract fields from batch
        input_variable, lengths, target_variable, mask, max_target_len = training_batch

        # Run a training iteration with batch
        loss = train(input_variable, lengths, target_variable, mask, max_target_len, encoder,
                     decoder, embedding, encoder_optimizer, decoder_optimizer, batch_size, clip)
        print_loss += loss

        # Print progress
        if iteration % print_every == 0:
            print_loss_avg = print_loss / print_every
            print("Iteration: {}; Percent complete: {:.1f}%; Average loss: {:.4f}".format(iteration, iteration / n_iteration * 100, print_loss_avg))
            print_loss = 0

        # Save checkpoint
        if (iteration % save_every == 0):
            directory = os.path.join(save_dir, model_name, corpus_name, '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size))
            if not os.path.exists(directory):
                os.makedirs(directory)
            torch.save({
                'iteration': iteration,
                'en': encoder.state_dict(),
                'de': decoder.state_dict(),
                'en_opt': encoder_optimizer.state_dict(),
                'de_opt': decoder_optimizer.state_dict(),
                'loss': loss,
                'voc_dict': voc.__dict__,
                'embedding': embedding.state_dict()
            }, os.path.join(directory, '{}_{}.tar'.format(iteration, 'checkpoint')))

## Define Evaluation
After training a model, we want to be able to talk to the bot ourselves. First, we must define how we want the model to decode the encoded input.

With the **Greedy decoding** we simply choose the word from decoder_output with the highest softmax value. This decoding method is optimal on a single time-step level.

The input sentence is evaluated using the following computational graph:

Computation Graph:

        1. Forward input through encoder model.
        2. Prepare encoder’s final hidden layer to be first hidden input to the decoder.
        3. Initialize decoder’s first input as SOS_token.
        4. Initialize tensors to append decoded words to.
        5. Iteratively decode one word token at a time:
                5.1 Forward pass through decoder.
                5.2 Obtain most likely word token and its softmax score.
                5.3 Record token and score.
                5.4 Prepare current token to be next decoder input.
        6. Return collections of word tokens and scores.


The class _GreedySearchDecoder_ takes an 
    - an input sequence (input_seq) of shape (input_seq length, 1)
    - a scalar input length (input_length) tensor, 
    - a max_length to bound the response sentence length. 

In [89]:
class GreedySearchDecoder(nn.Module):
    def __init__(self, encoder, decoder):
        super(GreedySearchDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, input_seq, input_length, max_length):
        # Forward input through encoder model
        encoder_outputs, encoder_hidden = self.encoder(input_seq, input_length)
        # Prepare encoder's final hidden layer to be first hidden input to the decoder
        decoder_hidden = encoder_hidden[:decoder.n_layers]
        # Initialize decoder input with SOS_token
        decoder_input = torch.ones(1, 1, device=device, dtype=torch.long) * SOS_token
        # Initialize tensors to append decoded words to
        all_tokens = torch.zeros([0], device=device, dtype=torch.long)
        all_scores = torch.zeros([0], device=device)
        # Iteratively decode one word token at a time
        for _ in range(max_length):
            # Forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden, encoder_outputs)
            # Obtain most likely word token and its softmax score
            decoder_scores, decoder_input = torch.max(decoder_output, dim=1)
            # Record token and score
            all_tokens = torch.cat((all_tokens, decoder_input), dim=0)
            all_scores = torch.cat((all_scores, decoder_scores), dim=0)
            # Prepare current token to be next decoder input (add a dimension)
            decoder_input = torch.unsqueeze(decoder_input, 0)
        # Return collections of word tokens and scores
        return all_tokens, all_scores

## Evaluate my text

We first format the sentence as an input batch of word indexes with _batch_size==1_. We do this by converting the words of the sentence to their corresponding indexes, and transposing the dimensions to prepare the tensor for our models. 

We also create a _lengths_ tensor which contains the length of our input sentence. In this case, lengths is scalar because we are only evaluating one sentence at a time (batch_size==1). 

Next, we obtain the decoded response sentence tensor using our _GreedySearchDecoder_ object (searcher). Finally, we convert the response’s indexes to words and return the list of decoded words.

In [90]:
def evaluate(encoder, decoder, searcher, voc, sentence, max_length=MAX_LENGTH):
    ### Format input sentence as a batch
    # words -> indexes
    indexes_batch = [indexesFromSentence(voc, sentence)]
    # Create lengths tensor
    lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
    # Transpose dimensions of batch to match models' expectations
    input_batch = torch.LongTensor(indexes_batch).transpose(0, 1)
    # Use appropriate device
    input_batch = input_batch.to(device)
    lengths = lengths.to(device)
    # Decode sentence with searcher
    tokens, scores = searcher(input_batch, lengths, max_length)
    # indexes -> words
    decoded_words = [voc.index2word[token.item()] for token in tokens]
    return decoded_words


def evaluateInput(encoder, decoder, searcher, voc):
    input_sentence = ''
    while(1):
        try:
            # Get input sentence
            input_sentence = input('> ')
            # Check if it is quit case
            if input_sentence == 'q' or input_sentence == 'quit': break
            # Normalize sentence
            input_sentence = normalizeString(input_sentence)
            # Evaluate sentence
            output_words = evaluate(encoder, decoder, searcher, voc, input_sentence)
            # Format and print response sentence
            output_words[:] = [x for x in output_words if not (x == 'EOS' or x == 'PAD')]
            print('Bot:', ' '.join(output_words))

        except KeyError:
            print("Error: Encountered unknown word.")

## Run Model

We must initialize the individual encoder and decoder models. In the following block, we set our desired configurations, choose to start from scratch or set a checkpoint to load from, and build and initialize the models. Feel free to play with different model configurations to optimize performance.

In [91]:
# Configure models
model_name = 'cb_model'
attn_model = 'dot'
#attn_model = 'general'
#attn_model = 'concat'
hidden_size = 500
encoder_n_layers = 2
decoder_n_layers = 2
dropout = 0.1
batch_size = 64

# Set checkpoint to load from; set to None if starting from scratch
loadFilename = None
checkpoint_iter = 4000
#loadFilename = os.path.join(save_dir, model_name, corpus_name,
#                            '{}-{}_{}'.format(encoder_n_layers, decoder_n_layers, hidden_size),
#                            '{}_checkpoint.tar'.format(checkpoint_iter))


# Load model if a loadFilename is provided
if loadFilename:
    # If loading on same machine the model was trained on
    checkpoint = torch.load(loadFilename)
    # If loading a model trained on GPU to CPU
    #checkpoint = torch.load(loadFilename, map_location=torch.device('cpu'))
    encoder_sd = checkpoint['en']
    decoder_sd = checkpoint['de']
    encoder_optimizer_sd = checkpoint['en_opt']
    decoder_optimizer_sd = checkpoint['de_opt']
    embedding_sd = checkpoint['embedding']
    voc.__dict__ = checkpoint['voc_dict']


print('Building encoder and decoder ...')
# Initialize word embeddings
embedding = nn.Embedding(voc.num_words, hidden_size)
if loadFilename:
    embedding.load_state_dict(embedding_sd)
# Initialize encoder & decoder models
encoder = EncoderRNN(hidden_size, embedding, encoder_n_layers, dropout)
decoder = LuongAttnDecoderRNN(attn_model, embedding, hidden_size, voc.num_words, decoder_n_layers, dropout)
if loadFilename:
    encoder.load_state_dict(encoder_sd)
    decoder.load_state_dict(decoder_sd)
# Use appropriate device
encoder = encoder.to(device)
decoder = decoder.to(device)
print('Models built and ready to go!')

Building encoder and decoder ...
Models built and ready to go!


## Run Training
Run the following block if you want to train the model.

In [92]:
# Configure training/optimization
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 1
save_every = 500

# Ensure dropout layers are in train mode
encoder.train()
decoder.train()

# Initialize optimizers
print('Building optimizers ...')
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * decoder_learning_ratio)
if loadFilename:
    encoder_optimizer.load_state_dict(encoder_optimizer_sd)
    decoder_optimizer.load_state_dict(decoder_optimizer_sd)

# Run training iterations
print("Starting Training!")
corpus_name="cornell movie-dialogs corpus"
trainIters(model_name, voc, pairs, encoder, decoder, encoder_optimizer, decoder_optimizer,
           embedding, encoder_n_layers, decoder_n_layers, save_dir, n_iteration, batch_size,
           print_every, save_every, clip, corpus_name, loadFilename)

Building optimizers ...
Starting Training!
Initializing ...
Training...
Iteration: 1; Percent complete: 0.0%; Average loss: 8.9640
Iteration: 2; Percent complete: 0.1%; Average loss: 8.7794
Iteration: 3; Percent complete: 0.1%; Average loss: 8.5129
Iteration: 4; Percent complete: 0.1%; Average loss: 8.0858
Iteration: 5; Percent complete: 0.1%; Average loss: 7.7061
Iteration: 6; Percent complete: 0.1%; Average loss: 7.0297
Iteration: 7; Percent complete: 0.2%; Average loss: 6.7559
Iteration: 8; Percent complete: 0.2%; Average loss: 6.7653
Iteration: 9; Percent complete: 0.2%; Average loss: 6.6249
Iteration: 10; Percent complete: 0.2%; Average loss: 6.3392
Iteration: 11; Percent complete: 0.3%; Average loss: 5.8704
Iteration: 12; Percent complete: 0.3%; Average loss: 5.5682
Iteration: 13; Percent complete: 0.3%; Average loss: 5.4669
Iteration: 14; Percent complete: 0.4%; Average loss: 5.1234
Iteration: 15; Percent complete: 0.4%; Average loss: 5.1026
Iteration: 16; Percent complete: 0.4%

Iteration: 136; Percent complete: 3.4%; Average loss: 3.8623
Iteration: 137; Percent complete: 3.4%; Average loss: 3.6680
Iteration: 138; Percent complete: 3.5%; Average loss: 3.8641
Iteration: 139; Percent complete: 3.5%; Average loss: 3.7788
Iteration: 140; Percent complete: 3.5%; Average loss: 3.7097
Iteration: 141; Percent complete: 3.5%; Average loss: 3.6508
Iteration: 142; Percent complete: 3.5%; Average loss: 3.4200
Iteration: 143; Percent complete: 3.6%; Average loss: 3.7099
Iteration: 144; Percent complete: 3.6%; Average loss: 3.6190
Iteration: 145; Percent complete: 3.6%; Average loss: 3.7419
Iteration: 146; Percent complete: 3.6%; Average loss: 3.6353
Iteration: 147; Percent complete: 3.7%; Average loss: 3.6859
Iteration: 148; Percent complete: 3.7%; Average loss: 3.6312
Iteration: 149; Percent complete: 3.7%; Average loss: 3.4630
Iteration: 150; Percent complete: 3.8%; Average loss: 3.5866
Iteration: 151; Percent complete: 3.8%; Average loss: 3.3943
Iteration: 152; Percent 

Iteration: 271; Percent complete: 6.8%; Average loss: 3.4805
Iteration: 272; Percent complete: 6.8%; Average loss: 3.4292
Iteration: 273; Percent complete: 6.8%; Average loss: 3.3239
Iteration: 274; Percent complete: 6.9%; Average loss: 3.4581
Iteration: 275; Percent complete: 6.9%; Average loss: 3.5567
Iteration: 276; Percent complete: 6.9%; Average loss: 3.2917
Iteration: 277; Percent complete: 6.9%; Average loss: 3.5348
Iteration: 278; Percent complete: 7.0%; Average loss: 3.2210
Iteration: 279; Percent complete: 7.0%; Average loss: 3.3123
Iteration: 280; Percent complete: 7.0%; Average loss: 3.4997
Iteration: 281; Percent complete: 7.0%; Average loss: 3.1684
Iteration: 282; Percent complete: 7.0%; Average loss: 3.2598
Iteration: 283; Percent complete: 7.1%; Average loss: 3.3245
Iteration: 284; Percent complete: 7.1%; Average loss: 3.2866
Iteration: 285; Percent complete: 7.1%; Average loss: 3.4350
Iteration: 286; Percent complete: 7.1%; Average loss: 3.2365
Iteration: 287; Percent 

Iteration: 406; Percent complete: 10.2%; Average loss: 3.2292
Iteration: 407; Percent complete: 10.2%; Average loss: 3.4649
Iteration: 408; Percent complete: 10.2%; Average loss: 3.4306
Iteration: 409; Percent complete: 10.2%; Average loss: 3.3615
Iteration: 410; Percent complete: 10.2%; Average loss: 3.7952
Iteration: 411; Percent complete: 10.3%; Average loss: 3.2507
Iteration: 412; Percent complete: 10.3%; Average loss: 3.3085
Iteration: 413; Percent complete: 10.3%; Average loss: 3.3951
Iteration: 414; Percent complete: 10.3%; Average loss: 3.2813
Iteration: 415; Percent complete: 10.4%; Average loss: 3.4904
Iteration: 416; Percent complete: 10.4%; Average loss: 3.2312
Iteration: 417; Percent complete: 10.4%; Average loss: 3.4664
Iteration: 418; Percent complete: 10.4%; Average loss: 3.4087
Iteration: 419; Percent complete: 10.5%; Average loss: 3.3325
Iteration: 420; Percent complete: 10.5%; Average loss: 3.1153
Iteration: 421; Percent complete: 10.5%; Average loss: 3.2943
Iteratio

Iteration: 539; Percent complete: 13.5%; Average loss: 3.2264
Iteration: 540; Percent complete: 13.5%; Average loss: 3.0807
Iteration: 541; Percent complete: 13.5%; Average loss: 3.1373
Iteration: 542; Percent complete: 13.6%; Average loss: 3.1240
Iteration: 543; Percent complete: 13.6%; Average loss: 3.1302
Iteration: 544; Percent complete: 13.6%; Average loss: 2.9836
Iteration: 545; Percent complete: 13.6%; Average loss: 3.0697
Iteration: 546; Percent complete: 13.7%; Average loss: 3.1523
Iteration: 547; Percent complete: 13.7%; Average loss: 3.1709
Iteration: 548; Percent complete: 13.7%; Average loss: 2.9833
Iteration: 549; Percent complete: 13.7%; Average loss: 3.0696
Iteration: 550; Percent complete: 13.8%; Average loss: 3.2955
Iteration: 551; Percent complete: 13.8%; Average loss: 3.1037
Iteration: 552; Percent complete: 13.8%; Average loss: 2.9540
Iteration: 553; Percent complete: 13.8%; Average loss: 2.9926
Iteration: 554; Percent complete: 13.9%; Average loss: 3.3073
Iteratio

Iteration: 672; Percent complete: 16.8%; Average loss: 3.1258
Iteration: 673; Percent complete: 16.8%; Average loss: 3.3692
Iteration: 674; Percent complete: 16.9%; Average loss: 3.0881
Iteration: 675; Percent complete: 16.9%; Average loss: 3.1288
Iteration: 676; Percent complete: 16.9%; Average loss: 2.9935
Iteration: 677; Percent complete: 16.9%; Average loss: 3.2272
Iteration: 678; Percent complete: 17.0%; Average loss: 3.1728
Iteration: 679; Percent complete: 17.0%; Average loss: 3.1486
Iteration: 680; Percent complete: 17.0%; Average loss: 3.1509
Iteration: 681; Percent complete: 17.0%; Average loss: 2.8876
Iteration: 682; Percent complete: 17.1%; Average loss: 3.1513
Iteration: 683; Percent complete: 17.1%; Average loss: 3.1485
Iteration: 684; Percent complete: 17.1%; Average loss: 3.1881
Iteration: 685; Percent complete: 17.1%; Average loss: 3.3922
Iteration: 686; Percent complete: 17.2%; Average loss: 3.0901
Iteration: 687; Percent complete: 17.2%; Average loss: 3.0614
Iteratio

Iteration: 805; Percent complete: 20.1%; Average loss: 3.5480
Iteration: 806; Percent complete: 20.2%; Average loss: 3.0097
Iteration: 807; Percent complete: 20.2%; Average loss: 2.9839
Iteration: 808; Percent complete: 20.2%; Average loss: 3.1696
Iteration: 809; Percent complete: 20.2%; Average loss: 3.2129
Iteration: 810; Percent complete: 20.2%; Average loss: 3.0666
Iteration: 811; Percent complete: 20.3%; Average loss: 2.9452
Iteration: 812; Percent complete: 20.3%; Average loss: 3.0707
Iteration: 813; Percent complete: 20.3%; Average loss: 2.8430
Iteration: 814; Percent complete: 20.3%; Average loss: 2.8643
Iteration: 815; Percent complete: 20.4%; Average loss: 2.7258
Iteration: 816; Percent complete: 20.4%; Average loss: 3.0063
Iteration: 817; Percent complete: 20.4%; Average loss: 3.0743
Iteration: 818; Percent complete: 20.4%; Average loss: 3.0147
Iteration: 819; Percent complete: 20.5%; Average loss: 2.9189
Iteration: 820; Percent complete: 20.5%; Average loss: 2.8129
Iteratio

Iteration: 938; Percent complete: 23.4%; Average loss: 3.1371
Iteration: 939; Percent complete: 23.5%; Average loss: 3.0894
Iteration: 940; Percent complete: 23.5%; Average loss: 2.7934
Iteration: 941; Percent complete: 23.5%; Average loss: 2.8677
Iteration: 942; Percent complete: 23.5%; Average loss: 3.0629
Iteration: 943; Percent complete: 23.6%; Average loss: 3.1040
Iteration: 944; Percent complete: 23.6%; Average loss: 3.0144
Iteration: 945; Percent complete: 23.6%; Average loss: 2.9473
Iteration: 946; Percent complete: 23.6%; Average loss: 2.8157
Iteration: 947; Percent complete: 23.7%; Average loss: 3.2349
Iteration: 948; Percent complete: 23.7%; Average loss: 2.9927
Iteration: 949; Percent complete: 23.7%; Average loss: 3.3079
Iteration: 950; Percent complete: 23.8%; Average loss: 2.6790
Iteration: 951; Percent complete: 23.8%; Average loss: 3.1629
Iteration: 952; Percent complete: 23.8%; Average loss: 2.9569
Iteration: 953; Percent complete: 23.8%; Average loss: 3.0938
Iteratio

Iteration: 1070; Percent complete: 26.8%; Average loss: 3.0362
Iteration: 1071; Percent complete: 26.8%; Average loss: 3.0671
Iteration: 1072; Percent complete: 26.8%; Average loss: 2.7761
Iteration: 1073; Percent complete: 26.8%; Average loss: 2.7654
Iteration: 1074; Percent complete: 26.9%; Average loss: 3.1021
Iteration: 1075; Percent complete: 26.9%; Average loss: 2.5134
Iteration: 1076; Percent complete: 26.9%; Average loss: 3.0114
Iteration: 1077; Percent complete: 26.9%; Average loss: 2.9207
Iteration: 1078; Percent complete: 27.0%; Average loss: 2.9558
Iteration: 1079; Percent complete: 27.0%; Average loss: 3.0198
Iteration: 1080; Percent complete: 27.0%; Average loss: 2.7673
Iteration: 1081; Percent complete: 27.0%; Average loss: 2.8356
Iteration: 1082; Percent complete: 27.1%; Average loss: 2.8460
Iteration: 1083; Percent complete: 27.1%; Average loss: 2.8537
Iteration: 1084; Percent complete: 27.1%; Average loss: 2.9850
Iteration: 1085; Percent complete: 27.1%; Average loss:

Iteration: 1201; Percent complete: 30.0%; Average loss: 3.0770
Iteration: 1202; Percent complete: 30.0%; Average loss: 2.9210
Iteration: 1203; Percent complete: 30.1%; Average loss: 2.6683
Iteration: 1204; Percent complete: 30.1%; Average loss: 2.6921
Iteration: 1205; Percent complete: 30.1%; Average loss: 2.5982
Iteration: 1206; Percent complete: 30.1%; Average loss: 2.9490
Iteration: 1207; Percent complete: 30.2%; Average loss: 2.8411
Iteration: 1208; Percent complete: 30.2%; Average loss: 3.0329
Iteration: 1209; Percent complete: 30.2%; Average loss: 3.1408
Iteration: 1210; Percent complete: 30.2%; Average loss: 2.7137
Iteration: 1211; Percent complete: 30.3%; Average loss: 2.8412
Iteration: 1212; Percent complete: 30.3%; Average loss: 2.8997
Iteration: 1213; Percent complete: 30.3%; Average loss: 2.8213
Iteration: 1214; Percent complete: 30.3%; Average loss: 2.9002
Iteration: 1215; Percent complete: 30.4%; Average loss: 3.1523
Iteration: 1216; Percent complete: 30.4%; Average loss:

Iteration: 1332; Percent complete: 33.3%; Average loss: 2.7372
Iteration: 1333; Percent complete: 33.3%; Average loss: 2.7746
Iteration: 1334; Percent complete: 33.4%; Average loss: 2.6241
Iteration: 1335; Percent complete: 33.4%; Average loss: 3.1221
Iteration: 1336; Percent complete: 33.4%; Average loss: 3.0360
Iteration: 1337; Percent complete: 33.4%; Average loss: 3.0710
Iteration: 1338; Percent complete: 33.5%; Average loss: 2.8885
Iteration: 1339; Percent complete: 33.5%; Average loss: 2.6843
Iteration: 1340; Percent complete: 33.5%; Average loss: 2.9558
Iteration: 1341; Percent complete: 33.5%; Average loss: 2.7614
Iteration: 1342; Percent complete: 33.6%; Average loss: 2.7893
Iteration: 1343; Percent complete: 33.6%; Average loss: 2.7367
Iteration: 1344; Percent complete: 33.6%; Average loss: 2.5633
Iteration: 1345; Percent complete: 33.6%; Average loss: 2.6907
Iteration: 1346; Percent complete: 33.7%; Average loss: 2.7314
Iteration: 1347; Percent complete: 33.7%; Average loss:

Iteration: 1463; Percent complete: 36.6%; Average loss: 2.7539
Iteration: 1464; Percent complete: 36.6%; Average loss: 2.8804
Iteration: 1465; Percent complete: 36.6%; Average loss: 2.9816
Iteration: 1466; Percent complete: 36.6%; Average loss: 2.7211
Iteration: 1467; Percent complete: 36.7%; Average loss: 2.8465
Iteration: 1468; Percent complete: 36.7%; Average loss: 2.6710
Iteration: 1469; Percent complete: 36.7%; Average loss: 2.7955
Iteration: 1470; Percent complete: 36.8%; Average loss: 3.0939
Iteration: 1471; Percent complete: 36.8%; Average loss: 2.6691
Iteration: 1472; Percent complete: 36.8%; Average loss: 2.8101
Iteration: 1473; Percent complete: 36.8%; Average loss: 2.7013
Iteration: 1474; Percent complete: 36.9%; Average loss: 2.8204
Iteration: 1475; Percent complete: 36.9%; Average loss: 2.8523
Iteration: 1476; Percent complete: 36.9%; Average loss: 2.7669
Iteration: 1477; Percent complete: 36.9%; Average loss: 2.6200
Iteration: 1478; Percent complete: 37.0%; Average loss:

Iteration: 1594; Percent complete: 39.9%; Average loss: 2.7275
Iteration: 1595; Percent complete: 39.9%; Average loss: 2.8901
Iteration: 1596; Percent complete: 39.9%; Average loss: 2.7442
Iteration: 1597; Percent complete: 39.9%; Average loss: 2.6442
Iteration: 1598; Percent complete: 40.0%; Average loss: 2.5766
Iteration: 1599; Percent complete: 40.0%; Average loss: 2.7670
Iteration: 1600; Percent complete: 40.0%; Average loss: 2.7904
Iteration: 1601; Percent complete: 40.0%; Average loss: 2.7080
Iteration: 1602; Percent complete: 40.1%; Average loss: 2.6406
Iteration: 1603; Percent complete: 40.1%; Average loss: 2.8068
Iteration: 1604; Percent complete: 40.1%; Average loss: 2.6520
Iteration: 1605; Percent complete: 40.1%; Average loss: 2.7827
Iteration: 1606; Percent complete: 40.2%; Average loss: 2.6833
Iteration: 1607; Percent complete: 40.2%; Average loss: 2.7671
Iteration: 1608; Percent complete: 40.2%; Average loss: 2.5759
Iteration: 1609; Percent complete: 40.2%; Average loss:

Iteration: 1725; Percent complete: 43.1%; Average loss: 2.9334
Iteration: 1726; Percent complete: 43.1%; Average loss: 2.8936
Iteration: 1727; Percent complete: 43.2%; Average loss: 2.6305
Iteration: 1728; Percent complete: 43.2%; Average loss: 2.7669
Iteration: 1729; Percent complete: 43.2%; Average loss: 2.5896
Iteration: 1730; Percent complete: 43.2%; Average loss: 2.7175
Iteration: 1731; Percent complete: 43.3%; Average loss: 2.6750
Iteration: 1732; Percent complete: 43.3%; Average loss: 2.7929
Iteration: 1733; Percent complete: 43.3%; Average loss: 2.8725
Iteration: 1734; Percent complete: 43.4%; Average loss: 3.0221
Iteration: 1735; Percent complete: 43.4%; Average loss: 2.6477
Iteration: 1736; Percent complete: 43.4%; Average loss: 2.8629
Iteration: 1737; Percent complete: 43.4%; Average loss: 2.8281
Iteration: 1738; Percent complete: 43.5%; Average loss: 2.5456
Iteration: 1739; Percent complete: 43.5%; Average loss: 2.5066
Iteration: 1740; Percent complete: 43.5%; Average loss:

Iteration: 1856; Percent complete: 46.4%; Average loss: 2.7892
Iteration: 1857; Percent complete: 46.4%; Average loss: 2.6575
Iteration: 1858; Percent complete: 46.5%; Average loss: 2.8373
Iteration: 1859; Percent complete: 46.5%; Average loss: 2.5285
Iteration: 1860; Percent complete: 46.5%; Average loss: 2.8067
Iteration: 1861; Percent complete: 46.5%; Average loss: 2.6797
Iteration: 1862; Percent complete: 46.6%; Average loss: 2.6248
Iteration: 1863; Percent complete: 46.6%; Average loss: 2.5902
Iteration: 1864; Percent complete: 46.6%; Average loss: 2.6812
Iteration: 1865; Percent complete: 46.6%; Average loss: 2.6534
Iteration: 1866; Percent complete: 46.7%; Average loss: 2.4988
Iteration: 1867; Percent complete: 46.7%; Average loss: 2.6707
Iteration: 1868; Percent complete: 46.7%; Average loss: 2.5070
Iteration: 1869; Percent complete: 46.7%; Average loss: 2.5832
Iteration: 1870; Percent complete: 46.8%; Average loss: 2.7826
Iteration: 1871; Percent complete: 46.8%; Average loss:

Iteration: 1987; Percent complete: 49.7%; Average loss: 2.7831
Iteration: 1988; Percent complete: 49.7%; Average loss: 2.4497
Iteration: 1989; Percent complete: 49.7%; Average loss: 2.7350
Iteration: 1990; Percent complete: 49.8%; Average loss: 2.4575
Iteration: 1991; Percent complete: 49.8%; Average loss: 2.7837
Iteration: 1992; Percent complete: 49.8%; Average loss: 2.4555
Iteration: 1993; Percent complete: 49.8%; Average loss: 2.5836
Iteration: 1994; Percent complete: 49.9%; Average loss: 2.9402
Iteration: 1995; Percent complete: 49.9%; Average loss: 2.7195
Iteration: 1996; Percent complete: 49.9%; Average loss: 2.6699
Iteration: 1997; Percent complete: 49.9%; Average loss: 2.9183
Iteration: 1998; Percent complete: 50.0%; Average loss: 2.9355
Iteration: 1999; Percent complete: 50.0%; Average loss: 2.8116
Iteration: 2000; Percent complete: 50.0%; Average loss: 2.7355
Iteration: 2001; Percent complete: 50.0%; Average loss: 2.4555
Iteration: 2002; Percent complete: 50.0%; Average loss:

Iteration: 2118; Percent complete: 52.9%; Average loss: 2.7923
Iteration: 2119; Percent complete: 53.0%; Average loss: 2.5294
Iteration: 2120; Percent complete: 53.0%; Average loss: 2.4976
Iteration: 2121; Percent complete: 53.0%; Average loss: 2.4873
Iteration: 2122; Percent complete: 53.0%; Average loss: 2.3496
Iteration: 2123; Percent complete: 53.1%; Average loss: 2.4361
Iteration: 2124; Percent complete: 53.1%; Average loss: 2.5315
Iteration: 2125; Percent complete: 53.1%; Average loss: 2.7106
Iteration: 2126; Percent complete: 53.1%; Average loss: 2.5786
Iteration: 2127; Percent complete: 53.2%; Average loss: 2.5304
Iteration: 2128; Percent complete: 53.2%; Average loss: 2.7666
Iteration: 2129; Percent complete: 53.2%; Average loss: 2.2902
Iteration: 2130; Percent complete: 53.2%; Average loss: 2.6998
Iteration: 2131; Percent complete: 53.3%; Average loss: 2.6914
Iteration: 2132; Percent complete: 53.3%; Average loss: 2.5875
Iteration: 2133; Percent complete: 53.3%; Average loss:

Iteration: 2249; Percent complete: 56.2%; Average loss: 2.5019
Iteration: 2250; Percent complete: 56.2%; Average loss: 2.5037
Iteration: 2251; Percent complete: 56.3%; Average loss: 2.6755
Iteration: 2252; Percent complete: 56.3%; Average loss: 2.5403
Iteration: 2253; Percent complete: 56.3%; Average loss: 2.5807
Iteration: 2254; Percent complete: 56.4%; Average loss: 2.6755
Iteration: 2255; Percent complete: 56.4%; Average loss: 2.6271
Iteration: 2256; Percent complete: 56.4%; Average loss: 2.7123
Iteration: 2257; Percent complete: 56.4%; Average loss: 2.4085
Iteration: 2258; Percent complete: 56.5%; Average loss: 2.4821
Iteration: 2259; Percent complete: 56.5%; Average loss: 2.8102
Iteration: 2260; Percent complete: 56.5%; Average loss: 2.6604
Iteration: 2261; Percent complete: 56.5%; Average loss: 2.8105
Iteration: 2262; Percent complete: 56.5%; Average loss: 2.6447
Iteration: 2263; Percent complete: 56.6%; Average loss: 2.5174
Iteration: 2264; Percent complete: 56.6%; Average loss:

Iteration: 2380; Percent complete: 59.5%; Average loss: 2.4707
Iteration: 2381; Percent complete: 59.5%; Average loss: 2.6207
Iteration: 2382; Percent complete: 59.6%; Average loss: 2.6110
Iteration: 2383; Percent complete: 59.6%; Average loss: 2.3816
Iteration: 2384; Percent complete: 59.6%; Average loss: 2.6907
Iteration: 2385; Percent complete: 59.6%; Average loss: 2.5856
Iteration: 2386; Percent complete: 59.7%; Average loss: 2.8312
Iteration: 2387; Percent complete: 59.7%; Average loss: 2.5940
Iteration: 2388; Percent complete: 59.7%; Average loss: 2.5705
Iteration: 2389; Percent complete: 59.7%; Average loss: 2.5950
Iteration: 2390; Percent complete: 59.8%; Average loss: 2.4095
Iteration: 2391; Percent complete: 59.8%; Average loss: 2.4765
Iteration: 2392; Percent complete: 59.8%; Average loss: 2.5005
Iteration: 2393; Percent complete: 59.8%; Average loss: 2.5456
Iteration: 2394; Percent complete: 59.9%; Average loss: 2.6368
Iteration: 2395; Percent complete: 59.9%; Average loss:

Iteration: 2511; Percent complete: 62.8%; Average loss: 2.4619
Iteration: 2512; Percent complete: 62.8%; Average loss: 2.2727
Iteration: 2513; Percent complete: 62.8%; Average loss: 2.6569
Iteration: 2514; Percent complete: 62.8%; Average loss: 2.4788
Iteration: 2515; Percent complete: 62.9%; Average loss: 2.5043
Iteration: 2516; Percent complete: 62.9%; Average loss: 2.4479
Iteration: 2517; Percent complete: 62.9%; Average loss: 2.4460
Iteration: 2518; Percent complete: 62.9%; Average loss: 2.4358
Iteration: 2519; Percent complete: 63.0%; Average loss: 2.3952
Iteration: 2520; Percent complete: 63.0%; Average loss: 2.6303
Iteration: 2521; Percent complete: 63.0%; Average loss: 2.6946
Iteration: 2522; Percent complete: 63.0%; Average loss: 2.5765
Iteration: 2523; Percent complete: 63.1%; Average loss: 2.4748
Iteration: 2524; Percent complete: 63.1%; Average loss: 2.5303
Iteration: 2525; Percent complete: 63.1%; Average loss: 2.4459
Iteration: 2526; Percent complete: 63.1%; Average loss:

Iteration: 2642; Percent complete: 66.0%; Average loss: 2.4039
Iteration: 2643; Percent complete: 66.1%; Average loss: 2.2503
Iteration: 2644; Percent complete: 66.1%; Average loss: 2.4835
Iteration: 2645; Percent complete: 66.1%; Average loss: 2.6283
Iteration: 2646; Percent complete: 66.1%; Average loss: 2.8477
Iteration: 2647; Percent complete: 66.2%; Average loss: 2.4019
Iteration: 2648; Percent complete: 66.2%; Average loss: 2.6307
Iteration: 2649; Percent complete: 66.2%; Average loss: 2.4572
Iteration: 2650; Percent complete: 66.2%; Average loss: 2.4453
Iteration: 2651; Percent complete: 66.3%; Average loss: 2.6015
Iteration: 2652; Percent complete: 66.3%; Average loss: 2.4181
Iteration: 2653; Percent complete: 66.3%; Average loss: 2.6944
Iteration: 2654; Percent complete: 66.3%; Average loss: 2.5948
Iteration: 2655; Percent complete: 66.4%; Average loss: 2.2818
Iteration: 2656; Percent complete: 66.4%; Average loss: 2.6923
Iteration: 2657; Percent complete: 66.4%; Average loss:

Iteration: 2773; Percent complete: 69.3%; Average loss: 2.2816
Iteration: 2774; Percent complete: 69.3%; Average loss: 2.3798
Iteration: 2775; Percent complete: 69.4%; Average loss: 2.3399
Iteration: 2776; Percent complete: 69.4%; Average loss: 2.4695
Iteration: 2777; Percent complete: 69.4%; Average loss: 2.5938
Iteration: 2778; Percent complete: 69.5%; Average loss: 2.4953
Iteration: 2779; Percent complete: 69.5%; Average loss: 2.4317
Iteration: 2780; Percent complete: 69.5%; Average loss: 2.5747
Iteration: 2781; Percent complete: 69.5%; Average loss: 2.6266
Iteration: 2782; Percent complete: 69.5%; Average loss: 2.7448
Iteration: 2783; Percent complete: 69.6%; Average loss: 2.2875
Iteration: 2784; Percent complete: 69.6%; Average loss: 2.3335
Iteration: 2785; Percent complete: 69.6%; Average loss: 2.3229
Iteration: 2786; Percent complete: 69.7%; Average loss: 2.2414
Iteration: 2787; Percent complete: 69.7%; Average loss: 2.5123
Iteration: 2788; Percent complete: 69.7%; Average loss:

Iteration: 2904; Percent complete: 72.6%; Average loss: 2.3137
Iteration: 2905; Percent complete: 72.6%; Average loss: 2.3456
Iteration: 2906; Percent complete: 72.7%; Average loss: 2.5041
Iteration: 2907; Percent complete: 72.7%; Average loss: 2.3367
Iteration: 2908; Percent complete: 72.7%; Average loss: 2.3483
Iteration: 2909; Percent complete: 72.7%; Average loss: 2.4224
Iteration: 2910; Percent complete: 72.8%; Average loss: 2.3775
Iteration: 2911; Percent complete: 72.8%; Average loss: 2.4316
Iteration: 2912; Percent complete: 72.8%; Average loss: 2.6058
Iteration: 2913; Percent complete: 72.8%; Average loss: 2.5854
Iteration: 2914; Percent complete: 72.9%; Average loss: 2.5289
Iteration: 2915; Percent complete: 72.9%; Average loss: 2.4455
Iteration: 2916; Percent complete: 72.9%; Average loss: 2.3146
Iteration: 2917; Percent complete: 72.9%; Average loss: 2.5584
Iteration: 2918; Percent complete: 73.0%; Average loss: 2.2854
Iteration: 2919; Percent complete: 73.0%; Average loss:

Iteration: 3035; Percent complete: 75.9%; Average loss: 2.4642
Iteration: 3036; Percent complete: 75.9%; Average loss: 2.5936
Iteration: 3037; Percent complete: 75.9%; Average loss: 2.4138
Iteration: 3038; Percent complete: 75.9%; Average loss: 2.4984
Iteration: 3039; Percent complete: 76.0%; Average loss: 2.3536
Iteration: 3040; Percent complete: 76.0%; Average loss: 2.5354
Iteration: 3041; Percent complete: 76.0%; Average loss: 2.5071
Iteration: 3042; Percent complete: 76.0%; Average loss: 2.5733
Iteration: 3043; Percent complete: 76.1%; Average loss: 2.2831
Iteration: 3044; Percent complete: 76.1%; Average loss: 2.2757
Iteration: 3045; Percent complete: 76.1%; Average loss: 2.4729
Iteration: 3046; Percent complete: 76.1%; Average loss: 2.2188
Iteration: 3047; Percent complete: 76.2%; Average loss: 2.3513
Iteration: 3048; Percent complete: 76.2%; Average loss: 2.2782
Iteration: 3049; Percent complete: 76.2%; Average loss: 2.3438
Iteration: 3050; Percent complete: 76.2%; Average loss:

Iteration: 3166; Percent complete: 79.1%; Average loss: 2.3131
Iteration: 3167; Percent complete: 79.2%; Average loss: 2.5408
Iteration: 3168; Percent complete: 79.2%; Average loss: 2.3669
Iteration: 3169; Percent complete: 79.2%; Average loss: 2.2686
Iteration: 3170; Percent complete: 79.2%; Average loss: 2.4143
Iteration: 3171; Percent complete: 79.3%; Average loss: 2.2752
Iteration: 3172; Percent complete: 79.3%; Average loss: 2.1899
Iteration: 3173; Percent complete: 79.3%; Average loss: 2.2410
Iteration: 3174; Percent complete: 79.3%; Average loss: 2.5028
Iteration: 3175; Percent complete: 79.4%; Average loss: 2.3569
Iteration: 3176; Percent complete: 79.4%; Average loss: 2.1899
Iteration: 3177; Percent complete: 79.4%; Average loss: 2.2570
Iteration: 3178; Percent complete: 79.5%; Average loss: 2.1301
Iteration: 3179; Percent complete: 79.5%; Average loss: 2.0539
Iteration: 3180; Percent complete: 79.5%; Average loss: 2.3113
Iteration: 3181; Percent complete: 79.5%; Average loss:

Iteration: 3297; Percent complete: 82.4%; Average loss: 2.5322
Iteration: 3298; Percent complete: 82.5%; Average loss: 2.1670
Iteration: 3299; Percent complete: 82.5%; Average loss: 2.2999
Iteration: 3300; Percent complete: 82.5%; Average loss: 2.4708
Iteration: 3301; Percent complete: 82.5%; Average loss: 2.0988
Iteration: 3302; Percent complete: 82.5%; Average loss: 2.1754
Iteration: 3303; Percent complete: 82.6%; Average loss: 2.3514
Iteration: 3304; Percent complete: 82.6%; Average loss: 2.2518
Iteration: 3305; Percent complete: 82.6%; Average loss: 2.6973
Iteration: 3306; Percent complete: 82.7%; Average loss: 2.4499
Iteration: 3307; Percent complete: 82.7%; Average loss: 2.2386
Iteration: 3308; Percent complete: 82.7%; Average loss: 2.3624
Iteration: 3309; Percent complete: 82.7%; Average loss: 2.1784
Iteration: 3310; Percent complete: 82.8%; Average loss: 2.2777
Iteration: 3311; Percent complete: 82.8%; Average loss: 2.2602
Iteration: 3312; Percent complete: 82.8%; Average loss:

Iteration: 3428; Percent complete: 85.7%; Average loss: 2.1102
Iteration: 3429; Percent complete: 85.7%; Average loss: 2.5112
Iteration: 3430; Percent complete: 85.8%; Average loss: 2.3674
Iteration: 3431; Percent complete: 85.8%; Average loss: 2.2398
Iteration: 3432; Percent complete: 85.8%; Average loss: 2.3746
Iteration: 3433; Percent complete: 85.8%; Average loss: 2.2394
Iteration: 3434; Percent complete: 85.9%; Average loss: 2.2918
Iteration: 3435; Percent complete: 85.9%; Average loss: 2.1464
Iteration: 3436; Percent complete: 85.9%; Average loss: 2.3801
Iteration: 3437; Percent complete: 85.9%; Average loss: 2.2720
Iteration: 3438; Percent complete: 86.0%; Average loss: 2.1030
Iteration: 3439; Percent complete: 86.0%; Average loss: 2.5685
Iteration: 3440; Percent complete: 86.0%; Average loss: 2.4026
Iteration: 3441; Percent complete: 86.0%; Average loss: 2.4991
Iteration: 3442; Percent complete: 86.1%; Average loss: 2.2831
Iteration: 3443; Percent complete: 86.1%; Average loss:

Iteration: 3559; Percent complete: 89.0%; Average loss: 2.2000
Iteration: 3560; Percent complete: 89.0%; Average loss: 2.2007
Iteration: 3561; Percent complete: 89.0%; Average loss: 2.0472
Iteration: 3562; Percent complete: 89.0%; Average loss: 2.3746
Iteration: 3563; Percent complete: 89.1%; Average loss: 2.2838
Iteration: 3564; Percent complete: 89.1%; Average loss: 2.3141
Iteration: 3565; Percent complete: 89.1%; Average loss: 2.1108
Iteration: 3566; Percent complete: 89.1%; Average loss: 2.2129
Iteration: 3567; Percent complete: 89.2%; Average loss: 2.2558
Iteration: 3568; Percent complete: 89.2%; Average loss: 2.2263
Iteration: 3569; Percent complete: 89.2%; Average loss: 2.4494
Iteration: 3570; Percent complete: 89.2%; Average loss: 2.2041
Iteration: 3571; Percent complete: 89.3%; Average loss: 2.4239
Iteration: 3572; Percent complete: 89.3%; Average loss: 2.1167
Iteration: 3573; Percent complete: 89.3%; Average loss: 2.1797
Iteration: 3574; Percent complete: 89.3%; Average loss:

Iteration: 3690; Percent complete: 92.2%; Average loss: 2.1770
Iteration: 3691; Percent complete: 92.3%; Average loss: 2.2312
Iteration: 3692; Percent complete: 92.3%; Average loss: 2.3978
Iteration: 3693; Percent complete: 92.3%; Average loss: 2.1374
Iteration: 3694; Percent complete: 92.3%; Average loss: 2.0962
Iteration: 3695; Percent complete: 92.4%; Average loss: 2.3865
Iteration: 3696; Percent complete: 92.4%; Average loss: 2.1399
Iteration: 3697; Percent complete: 92.4%; Average loss: 2.1973
Iteration: 3698; Percent complete: 92.5%; Average loss: 2.4100
Iteration: 3699; Percent complete: 92.5%; Average loss: 2.2106
Iteration: 3700; Percent complete: 92.5%; Average loss: 2.0154
Iteration: 3701; Percent complete: 92.5%; Average loss: 2.0418
Iteration: 3702; Percent complete: 92.5%; Average loss: 2.1145
Iteration: 3703; Percent complete: 92.6%; Average loss: 2.2264
Iteration: 3704; Percent complete: 92.6%; Average loss: 2.0730
Iteration: 3705; Percent complete: 92.6%; Average loss:

Iteration: 3821; Percent complete: 95.5%; Average loss: 2.1650
Iteration: 3822; Percent complete: 95.5%; Average loss: 2.1178
Iteration: 3823; Percent complete: 95.6%; Average loss: 2.2274
Iteration: 3824; Percent complete: 95.6%; Average loss: 2.0697
Iteration: 3825; Percent complete: 95.6%; Average loss: 2.0800
Iteration: 3826; Percent complete: 95.7%; Average loss: 2.2927
Iteration: 3827; Percent complete: 95.7%; Average loss: 2.2988
Iteration: 3828; Percent complete: 95.7%; Average loss: 2.2492
Iteration: 3829; Percent complete: 95.7%; Average loss: 2.3555
Iteration: 3830; Percent complete: 95.8%; Average loss: 2.2244
Iteration: 3831; Percent complete: 95.8%; Average loss: 2.2224
Iteration: 3832; Percent complete: 95.8%; Average loss: 2.2153
Iteration: 3833; Percent complete: 95.8%; Average loss: 2.1055
Iteration: 3834; Percent complete: 95.9%; Average loss: 2.1703
Iteration: 3835; Percent complete: 95.9%; Average loss: 2.2045
Iteration: 3836; Percent complete: 95.9%; Average loss:

Iteration: 3952; Percent complete: 98.8%; Average loss: 2.2885
Iteration: 3953; Percent complete: 98.8%; Average loss: 2.0521
Iteration: 3954; Percent complete: 98.9%; Average loss: 2.2620
Iteration: 3955; Percent complete: 98.9%; Average loss: 1.9946
Iteration: 3956; Percent complete: 98.9%; Average loss: 2.0985
Iteration: 3957; Percent complete: 98.9%; Average loss: 2.2093
Iteration: 3958; Percent complete: 99.0%; Average loss: 2.1317
Iteration: 3959; Percent complete: 99.0%; Average loss: 2.2409
Iteration: 3960; Percent complete: 99.0%; Average loss: 2.3294
Iteration: 3961; Percent complete: 99.0%; Average loss: 2.0972
Iteration: 3962; Percent complete: 99.1%; Average loss: 2.1715
Iteration: 3963; Percent complete: 99.1%; Average loss: 2.1289
Iteration: 3964; Percent complete: 99.1%; Average loss: 2.2907
Iteration: 3965; Percent complete: 99.1%; Average loss: 2.1187
Iteration: 3966; Percent complete: 99.2%; Average loss: 2.0166
Iteration: 3967; Percent complete: 99.2%; Average loss:

## To chat with your model

In [None]:
# Set dropout layers to eval mode
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder, decoder)

# Begin chatting (uncomment and run the following line to begin)
evaluateInput(encoder, decoder, searcher, voc)

> Hello
Bot: hello .
> How are you?
Bot: i m fine .
> Great. What are you doing?
Bot: i m getting a little .
> I understand. What is your name?
Bot: dr . crowe .
> Brillant. Which is your job?
Error: Encountered unknown word.
> It is very difficult task, isn't it?
Error: Encountered unknown word.
> Yeah. I already understood.
Bot: what do you want ?
> Know you better. Can I?
Bot: yeah .
