## ChatBot Tutorial 

from https://pytorch.org/tutorials/beginner/chatbot_tutorial.html

Using Recurrent seq-toseq modeling train a simple chatbot on Cornell Movie Dialogue Corpus

Objective : Learn seq to seq modeling , RNN ,training encoder-decoder together, get more practice on text data and NLP tasks

Future: Use the same model on some different dataset

### StartupTasks

In [1]:
!pip3 install torch torchvision

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 27kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x5856c000 @  0x7f500738a2a4 0x594e17 0x626104 0x51190a 0x4f5277 0x510c78 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f6070 0x510c78 0x5119bd 0x4f5277 0x4f3338 0x510fb0 0x5119bd 0x4f6070 0x4f3338 0x510fb0 0x5119bd 0x4f6070
[?25hCollecting torchvision
[?25l  Downloading https://files.pythonhosted.org/packages/ca/0d/f00b2885711e08bd71242ebe7b96561e6f6d01fdb4b9dcf4d37e2e13c5e1/torchvision-0.2.1-py2.py3-none-any.whl (54kB)
[K    100% |████████████████████████████████| 61kB 19.6MB/s 
[?25hCollecting pillow>=4.1.1 (from torchvision)
[?25l  Downloading https://files.pythonhosted.org/packages/62/94/5430ebaa83f91cc7a

In [2]:
!wget http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip

--2018-10-22 11:53:52--  http://www.cs.cornell.edu/~cristian/data/cornell_movie_dialogs_corpus.zip
Resolving www.cs.cornell.edu (www.cs.cornell.edu)... 132.236.207.20
Connecting to www.cs.cornell.edu (www.cs.cornell.edu)|132.236.207.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9916637 (9.5M) [application/zip]
Saving to: ‘cornell_movie_dialogs_corpus.zip’


2018-10-22 11:53:53 (5.91 MB/s) - ‘cornell_movie_dialogs_corpus.zip’ saved [9916637/9916637]



In [3]:
!mkdir data
!mv /content/cornell_movie_dialogs_corpus.zip /content/data/
!unzip -q /content/data/cornell_movie_dialogs_corpus.zip
!ls /content/data/

cornell_movie_dialogs_corpus.zip


In [4]:
!ls /content/'cornell movie-dialogs corpus'

chameleons.pdf		       movie_lines.txt		  README.txt
movie_characters_metadata.txt  movie_titles_metadata.txt
movie_conversations.txt        raw_script_urls.txt


In [0]:
import torch
from torch.jit import script,trace
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import csv
import random
import re
import os
import unicodedata
import codecs
from io import open
import itertools
import math

In [0]:
USE_CUDA = torch.cuda.is_available()
device = torch.device("cuda" if USE_CUDA else "cpu")

### Loading and Preprocessing data

~ 250 lines of preprocessing code

In [7]:
# Looking at the data 

corpus_name = "cornell movie-dialogs corpus"
corpus = os.path.join(corpus_name)

def printlines(file,n=10):
  with open(file,"rb") as datafile:
    lines = datafile.readlines()
  
  for line in lines[:n]:
    print(line)
    
printlines(os.path.join(corpus,"movie_lines.txt"))

b'L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!\n'
b'L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!\n'
b'L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.\n'
b'L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?\n'
b"L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.\n"
b'L924 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ Wow\n'
b"L872 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Okay -- you're gonna need to learn how to lie.\n"
b'L871 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ No\n'
b'L870 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I\'m kidding.  You know how sometimes you just become this "persona"?  And you don\'t know how to quit?\n'
b'L869 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Like my fear of wearing pastels?\n'


**Create Formatted Data File** 

Tab separated query and response sentences

loadLines - split each line of the file into a dictionary of fields (lineID,characterId,movieId,character,text)

loadConversations - group fields of lines from loadLines into conversations based on movie_conversation.txt

extractSentencePairs extracts pars of sentences from conversations 

In [0]:
# split each line of the file into a dictionary of fields
def loadLines(filename,fields):
  lines ={}
  with open(filename,'r',encoding='iso-8859-1') as f:
    for line in f:
      values = line.split("+++$+++")
      # extracted fields
      line_obj={}
      for i,field in enumerate(fields):
        # object
        line_obj[field]=values[i]
      # line id key for nested line object 
      # stripping as lineID is coming with a space in end 'L194 '
      lines[line_obj['lineID'].strip()]=line_obj
  return lines

# group fields of lines from loadlines into conversations based on movie_conversations
def loadConversations(filename,lines,fields):
  conversations=[]
  with open(filename,'r',encoding='iso-8859-1') as f:
    for line in f:
      values = line.split("+++$+++")
      conv_obj={}
      for i,field in enumerate(fields):
        conv_obj[field] = values[i]
      #Convert string to a list conv_obj["utteranceID"] =="[L598485,L...]"
      lineIds = eval(conv_obj["utteranceID"])
      # reassemble lines
      conv_obj["lines"]=[]
      for lineId in lineIds:
        conv_obj["lines"].append(lines[lineId])
      conversations.append(conv_obj)
  return conversations

# Extract pair of sentences from conversations
def extractSentencePairs(conversations):
  qa_pairs=[]
  for conversation in conversations:
    # Iterate over all lines 
    for i in range(len(conversation["lines"])-1):
      inputLine = conversation["lines"][i]["text"].strip()
      targetLine = conversation["lines"][i+1]["text"].strip()
      # add to qa pair if both exist
      if inputLine and targetLine:
        qa_pairs.append([inputLine,targetLine])
  return qa_pairs

In [9]:
# using the functions to generate conversation and create a new file - datafile

datafile = os.path.join(corpus,"formatted_movie_lines.txt")

delimiter ='\t'
# unescaping the delimiter
delimiter = str(codecs.decode(delimiter,"unicode_escape"))


#Initialize the lines dict , conversation dict and fields for lines and conversation
lines={}
conversations=[]
MOVIE_LINES_FIELDS = ["lineID","characterID","movieID","character","text"]
MOVIE_CONVERSATION_FIELDS = ["characterID","character2ID","movieID","utteranceID"]

# Load lines and preproces conversations 
print("Preprocessing, loading lines")
lines = loadLines(os.path.join(corpus,"movie_lines.txt"),MOVIE_LINES_FIELDS)
print("Loading Conversations")
conversations = loadConversations(os.path.join(corpus,"movie_conversations.txt"),lines,MOVIE_CONVERSATION_FIELDS)

# writing a new file
print("writing formatted file")
with open(datafile,'w',encoding='utf-8') as outfile:
  writer = csv.writer(outfile,delimiter=delimiter)
  for pair in extractSentencePairs(conversations):
    writer.writerow(pair)



Preprocessing, loading lines
Loading Conversations
writing formatted file


In [10]:
#printing sample data from formatted file 
printlines(datafile)

b"Can we make this quick?  Roxanne Korrine and Andrew Barrett are having an incredibly horrendous public break- up on the quad.  Again.\tWell, I thought we'd start with pronunciation, if that's okay with you.\r\n"
b"Well, I thought we'd start with pronunciation, if that's okay with you.\tNot the hacking and gagging and spitting part.  Please.\r\n"
b"Not the hacking and gagging and spitting part.  Please.\tOkay... then how 'bout we try out some French cuisine.  Saturday?  Night?\r\n"
b"You're asking me out.  That's so cute. What's your name again?\tForget it.\r\n"
b"No, no, it's my fault -- we didn't have a proper introduction ---\tCameron.\r\n"
b"Cameron.\tThe thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\r\n"
b"The thing is, Cameron -- I'm at the mercy of a particularly hideous breed of loser.  My sister.  I can't date until she does.\tSeems like she could get a date easy enough...\r\n"
b'Why?\tUnsolved myster

**Load and Trim data **

Creating a vocabulary and loading query/response pairs


In [0]:
# default word tokens
PAD_token = 0 # for padding short sentences
SOS_token = 1 # start of sentence token
EOS_token = 2 # end of sentence token 

class Voc:
  def __init__(self,name):
    self.name = name
    self.trimmed = False
    self.word2index = {}
    self.word2count = {}
    self.index2word = {PAD_token:"PAD",SOS_token:"SOS",EOS_token:"EOS"}
    self.num_words = 3 # default 3 
  
  
  def addSentence(self,sentence):
    for word in sentence.split(' '):
      self.addWord(word)
 
  def addWord(self,word):
    if word not in self.word2index:
      self.word2index[word] = self.num_words
      self.word2count[word] = 1
      self.index2word[self.num_words] = word
      self.num_words += 1
    else:
      self.word2count[word] +=1
 #remove words below a certain threshold
  def trim(self,min_count):
      if self.trimmed:
        return
      self.trimmed = True
      keep_words = []
      for k,v in self.word2count.items():
        if v>=min_count:
          keep_words.append(k)
    
      print(f'Keep words {len(keep_words)},{len(self.word2index)},{len(keep_words)/len(self.word2index)}')
    
      # reinitialize dictionaries
      self.word2index = {}
      self.word2count = {}
      self.index2words = {PAD_token:"PAD",SOS_token:"SOS",EOS_token:"EOS"}
      self.num_words = 3 # default 3 tokens
    
      for word in keep_words:
        self.addWord(word)

 Assemble vocabulary and query/response pairs ,
 
 But need to convert Unicode strings to ASCII using unicodeToAscii 
 
Then convert all characters to lowercase and trim non-letter characters - normalizeString

Finally filter out sentences with length greater than MAX_LENGTH - filterPairs

In [12]:
MAX_LENGTH = 10 # max sentence length to consider

# turning unicode to Ascii
def unicodeToAscii(s):
  #"Mn" stands for Nonspacing_Mark
  return ''.join(
    c for c in unicodedata.normalize('NFD',s) if unicodedata.category(c)!='Mn')

# lowercase and rim and remove non-letter character
def normalizeString(s):
  s = unicodeToAscii(s.lower().strip())
  s = re.sub(r"([.!?])",r" \1",s)
  s = re.sub(r"[^a-zA-Z.!?]+",r" ",s)
  s = re.sub(r"\s+",r" ",s).strip()
  return s

# read a query response pair and return a vocab object
def readVocs(datafile,corpus_name):
  print("reading lines")
  # read file and split into lines
  lines = open(datafile,encoding='utf-8').read().strip().split('\n')
  # split every line into pairs and normalize
  pairs =[[normalizeString(s) for s in l.split('\t')] for l in lines]
  voc = Voc(corpus_name)
  return voc,pairs

# Return true if both sentences in pair are under max length
def filterPair(p):
  return len(p[0].split(' ')) < MAX_LENGTH and len(p[1].split(' ')) < MAX_LENGTH

# check which pairs are valid
def filterPairs(pairs):
  return [pair for pair in pairs if filterPair(pair)]

# Using all of the above method return a populated voc object and pairs list
def loadPrepareData(corpus,corpus_name,datafile,save_dir):
  print('Start preparing training data')
  voc,pairs = readVocs(datafile,corpus_name)
  print(f"Read {len(pairs)} sentence pairs")
  pairs = filterPairs(pairs)
  print(f"Trimmed to {len(pairs)} sentence pairs")
  for pair in pairs:
    voc.addSentence(pair[0])
    voc.addSentence(pair[1])
  print("Counted Words :",voc.num_words)
  return voc,pairs

# Load/Assemble voc and pairs
save_dir = os.path.join("data","save")
voc,pairs = loadPrepareData(corpus,corpus_name,datafile,save_dir)


Start preparing training data
reading lines
Read 221282 sentence pairs
Trimmed to 64271 sentence pairs
Counted Words : 18008


In [13]:
# checking some pairs
for pair in pairs[:10]:
  print(pair)

['there .', 'where ?']
['you have my word . as a gentleman', 'you re sweet .']
['hi .', 'looks like things worked out tonight huh ?']
['you know chastity ?', 'i believe we share an art instructor']
['have fun tonight ?', 'tons']
['well no . . .', 'then that s all you had to say .']
['then that s all you had to say .', 'but']
['but', 'you always been this selfish ?']
['do you listen to this crap ?', 'what crap ?']
['what good stuff ?', 'the real you .']


Trimming rarely used words for faster convergence during training 

Trim words used under MIN_COUNT threshold using voc.trim

Filter out pairs with trimmed words

In [0]:
MIN_COUNT = 3

def trimRareWords(voc,pairs,MIN_COUNT):
  # trim words used under the MIN_COUNT from voc
  voc.trim(MIN_COUNT)
  # filter out pairs with trimmed words
  keep_pairs=[]
  for pair in pairs:
    input_sentence = pair[0]
    output_sentence = pair[1]
    keep_input = True
    keep_output = True
    # check input sentence 
    for word in input_sentence.split(' '):
      if word not in voc.word2index:
        keep_input = False
        break
    # check output sentence 
    for word in output_sentence.split(' '):
      if word not in voc.word2index:
        keep_output = False
        break
    # keep pair if both input and output contain words greater than min frequency
    if keep_input and keep_output:
      keep_pairs.append(pair)
  print(f"Trimmed from {len(pairs)} to {len(keep_pairs)} , {len(keep_pairs)/len(pairs)} of total")
  return keep_pairs    


In [15]:
# Trim voc and pairs
pairs = trimRareWords(voc,pairs,MIN_COUNT)

Keep words 7823,18005,0.43449041932796445
Trimmed from 64271 to 53165 , 0.8272004481025657 of total


**Prepare Data for Model **

Need to convert words into numerical torch tensors . 

Using mini-batch allows for faster training . But will have to accomodate sentences of different sizes in the same batch. Thus will make the batched input tensor of shape ( max_length , batch_size). Sentences shorter than max_length are zero padded after EOS_token


Simply converting sentences to their indexes ( indexesFromSentences ) and zero pad , the resulting tensor will have shape (batch_size,max_length)  and indexing the first dimension will return a full sequence across all time steps. - 

But we need to index the batch across time and across all sequences in the batch , Thus will have to take transpose of the input batch to ( max_length, batch_size) . This ensures that indexing along first dimension retursn a time step across all sentences in the batch

Convert sentences to tensor and return a tensor of lengths for all sequences in batch ( to be passed to decoder ) - inputVar

outputVar - similar to inputVar but instead of returning a lengths tensor , return a binary mask tensor and max target sentence length . Binary mask tensor has the same shape as the output target tensor but every element that is PAD_token is 0 and rest are 1

batch2TrainData - take a bunch of  pairs and return the inpit and target tensor 


In [0]:
def indexesFromSentence(voc,sentence):
  return [voc.word2index[word] for word in sentence.split(' ')] + [EOS_token]

def zeroPadding(l, fillvalue=PAD_token):
  return list(itertools.zip_longest( *l,fillvalue=fillvalue))

def binaryMatrix(l, value=PAD_token):
  m =[]
  for i, seq in enumerate(l):
    m.append([])
    for token in seq:
      if token == PAD_token:
        m[i].append(0)
      else:
        m[i].append(1)
  return m


# returns padded input sequence tensor amd lengths
def inputVar(l,voc):
  indexes_batch =  [indexesFromSentence(voc,sentence) for sentence in l]
  lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
  padList = zeroPadding(indexes_batch)
  padVar = torch.LongTensor(padList)
  return padVar,lengths


# return padded target sequence tensor, padding mask and max target length
def outputVar(l,voc):
  indexes_batch = [indexesFromSentence(voc,sentence) for sentence in l]
  max_target_len = max([len(indexes) for indexes in  indexes_batch])
  padList = zeroPadding(indexes_batch)
  mask = binaryMatrix(padList)
  mask = torch.ByteTensor(mask)
  padVar = torch.LongTensor(padList)
  return padVar,mask,max_target_len

# return all items for a given batch of pairs
def batch2TrainData(voc,pair_batch):
  pair_batch.sort(key=lambda x:len(x[0].split(" ")),reverse=True )
  input_batch, output_batch =[],[]
  
  for pair in pair_batch:
    input_batch.append(pair[0])
    output_batch.append(pair[1])
  inp,lengths = inputVar(input_batch,voc)
  output,mask,max_target_len =  outputVar(output_batch,voc)
  return inp,lengths,output,mask,max_target_len



In [17]:
# Validating
small_batch_size =5
batches = batch2TrainData(voc,[random.choice(pairs) for _ in range(small_batch_size)])
input_variable,lengths,target_variable,mask, max_target_len = batches

print("input variables ",input_variable)
print("lengths ",lengths)
print("target variable ", target_variable)
print("mask ",mask)
print("max target len",max_target_len)





input variables  tensor([[ 167,   38,   65,   25,   16],
        [ 101,  266,  331,  197, 1000],
        [  37,   59,  117,  117,    4],
        [3597,   76,  401,    4,    2],
        [ 230,   60, 1735,    2,    0],
        [  52,    4,    4,    0,    0],
        [  18,    2,    2,    0,    0],
        [  36,    0,    0,    0,    0],
        [   4,    0,    0,    0,    0],
        [   2,    0,    0,    0,    0]])
lengths  tensor([10,  7,  7,  5,  4])
target variable  tensor([[ 290,    4,   50,  147,   16],
        [  27,    4,   47,   68, 3256],
        [ 213,    4,    7,    7,    4],
        [  12,  477,  260,  259,  147],
        [3590,    4,   65,  174,   92],
        [   6,    2,  331,    6,    7],
        [   2,    0,  117,    2,    6],
        [   0,    0,  401,    0,    2],
        [   0,    0,    6,    0,    0],
        [   0,    0,    2,    0,    0]])
mask  tensor([[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1,

### Define Models


The main model is a seq-to-seq model, which takes a variable length sequence as an input and return a variable length sequence as output  using a fixed size model

Two separate RNN's  running together . One acts as a **encoder** , encoding a variable length input sequence to a fixed length context vector . This context vector
( final hidden layer of the RNN) will contain semantic info about the query sentence that is input.

The second RNN acts as a **decoder** which takes an input word and context vector and returns a guess for the next word in the sequence and a hidden state to use in next iteration


** Encoder **


Encoder RNN iterates through the input sentence one token at a time and outputs a hidden state vector and output vector at each time step .

Hidden state is passed to the next time step while ouput vector is recorded . Transforms the context into a set of points in high dimensional space , which decoder uses for  generating output

**GRU**

Using a bidirectional Gated Recurrent Unit (GRU) ; meaning essentially there are 2 independent RNN's . One fed the input sequence in normal sequential order and other in opposite order . The outputs of both are summed at each time step ensuring past and future context

Using embedding layer to encode word indexes into arbitrary feature space. Here it will map each word to a feature space of size hidden_size 

* If passing a padded batch of sequence into an RNN module , we must pack and unpack padding around the RNN pass using **torch.nn.utils.rnn.pack_padded_sequence** and **torch.nn.utils.rnn.pad_packed_sequence** respectively

Computation Graph 


1.   Convert word indexes to embeddings
2.   Pack padded batch of sequence for RNN module
3.   Forward pass through GRU
4.   Unpack Padding
5.   Sum bidirectional GRU outputs
6.  Return output and final hidden state


Inputs :
* input_seq: batch of input sentences;shape=(max_length,batch_size)
* input_lengths :list of sentence lengths corresponding to each sentence in the batch; shape=(batch_size)
* hidden:hidden state shape=(n_layers x num_directions, batch_size, hidden_size)

Outputs:
* outputs : output features of the last hidden layer of the GRU ( sum of bidirectional outputs);  shape=(max_length,batch_size,hidden_size)
* hidden : updated hidden state from GRU; shape = (n_layers x num_directionsm batch_size, hidden_size)



In [0]:
class EncoderRNN(nn.Module):
  def __init__(self,hidden_state,embedding,n_layers=1,dropout=0):
    super(EncoderRNN,self).__init__()
    self.n_layers = n_layers
    self.hidden_size = hidden_size
    self.embedding = embedding
    
    # initalize GRU - the input size and hidden size params are both set to hidden params
    # as the input size is a word embedding with no of features == hidden_size
    self.gru = nn.GRU(hidden_size,hidden_size,n_layers,
                      dropout=(0 if n_layers==1 else dropout),bidirectional=True)
    
  def forward(self,input_seq,input_lengths,hidden=None):
    # convert word indexes to embeddings
    embedded = self.embedding(input_seq)
    # pack padded batch of sequences for RNN module
    packed =  torch.nn.utils.rnn.pack_padded_sequence(embedded,input_lengths)
    # forward pass through GRU
    outputs, hidden = self.gru(packed,hidden)
    # unpack padding
    outputs,_ = torch.nn.utils.rnn.pad_packed_sequence(outputs)
    # sum bidirectional GRU outputs
    outputs = outputs[:,:,:self.hidden_size]+outputs[:,:,self.hidden_size:]
    # return output and final hidden state
    return outputs , hidden

** Decoder **

A decoder RNN generates sentence in  a tken-by-token fashion using encoder's context vectors and internal hidden states to generate the next word in the sequence. 

It generates tokens until it reaches an EOS_token.

Problem with vanilla seq-to-seq decoder - information loss especially with long input sequences , limiting the capability of the decoder 

** Attention Mechanism** allows decoder to look at certain parts of input rather than the entire fixed context at every step.

** Attention **  is calculated using decoder's current hidden state and the encoder's outputs.
The output attention weights have same shape as the input sequence allowing them to be multiplied by the encoder's outputs giving a weighted sum which indicates the part of the encoder output to pay attention to.

** Global Attention ** consider all of the encoder's hidden states ( instead of the hidden state of current time step) 

Also calculate attention  weights using the hidden state of the decoder from the current time step only ( local requires knowledge from previous time step)

Score functions 

Implement attention layer as a separate nn.Module called Attn - o/p is a softmax normalized weights tensor of shape ( batch_size,1,max_length)

In [0]:
class Attn(nn.Module):
  def __init__(self,method,hidden_size):
    super(Attn,self).__init__()
    self.method = method
    if self.method not in ['dot','general','concat']:
      raise ValueError(self.method," is not an appropriate attention method")
    self.hidden_size = hidden_size
    if self.method == 'general':
      self.attn = torch.nn.Linear(self.hidden_size,hidden_size)
    elif self.method == 'concat':
      self.attn = torch.nn.Linear(self.hidden_size*2,hidden_size)
      self.v = torch.nn.Parameter(torch.FloatTensor(hidden_size))
  
  def dot_score(self,hidden,encoder_output):
    return torch.sum(hidden * encoder_output,dim=2)
  
  def general_score(self,hidden,encoder_output):
    energy = self.attn(encoder_output)
    return torch.sum(hidden*energy,dim=2)
  
  def concat_score(self,hidden,encoder_output):
    energy = self.attn(torch.cat(
        (hidden.expand(encoder_output.size(0),-1,-1),encoder_output),2)).tanh()
    return torch.sum(self.v*energy,dim=2)
  
  def forward(self,hidden,encoder_outputs):
    # calculate the attention weights according to method
    if self.method == 'general':
      self.energies = self.general_score(hidden,encoder_outputs)
    elif self.method == 'concat':
      self.energies = self.concat_score(hidden,encoder_outputs)
    elif self.method == 'dot':
      attn_energies = self.dot_score(hidden,encoder_outputs)
      
      
    # transpose max length and batch size dimensions 
    attn_energies = attn_energies.t()
    
    # return the softmax normalied probabiity scores (with added dimension)
    return F.softmax(attn_energies,dim=1).unsqueeze(1)

Implementing the actual decoder  model  . Manually feed our batch one time step at a time . Thus embedded word tensor and GRU output will both have shape(1,batch_size,hidden_size)

Computation Graph :


1.   Get embedding of current input word
2.   Forward through unidirectional GRU
3.   Calculate attention weights from current GRU output from (2)
4.   Multiply attention weights to encoder outputs to get new weighted sum context vector
5.   Concatenate weighted context vector and GRU output 
6.   Predict next word using Luong eg ( without softmax)
7.   Return output and final hidden state


Inputs:

* input_step :one tie step  (one word) of input sequence batch ; shape(1,batch_size)
* last_hidden : final hidden layer of GRU ; shape (n_layers x num_directions , batch_size,hidden_size )
* encoder_outputs : encoder model's output; shape ( max_length,batch_size,hidden_size)

Outputs:

*  output : softmax normalised tensor giving probabilities to each word being the correct next word in the decoded sequence ; shape(batc_size,voc.num_words)
* hidden : final hidden state of GRU; shape (n_layers x num_directions ,batch_size , hidden_size)


In [0]:
class LuongAttnDecoderRNN(nn.Module):
  def __init__(self,attn_model,embedding,hidden_size,output_size,n_layers=1,dropout=0.1):
    super(LuongAttnDecoderRNN,self).__init__()
    # keep for reference 
    self.attn_model = attn_model
    self.hidden_size = hidden_size
    self.output_size = output_size
    self.n_layers = n_layers
    self.dropout = dropout
    
    # define layers 
    self.embedding = embedding
    self.embedding_dropout = nn.Dropout(dropout)
    self.gru = nn.GRU(hidden_size,hidden_size,n_layers,dropout=(0 if n_layers==1 else dropout))
    self.concat = nn.Linear(hidden_size*2,hidden_size)
    self.out = nn.Linear(hidden_size,output_size)
    self.attn = Attn(attn_model,hidden_size)
 
  def forward(self,input_step,last_hidden,encoder_outputs):
    # Run thhs one step(word) at a time 
    # Get embedding of current input word
    embedded = self.embedding(input_step)
    emebdded = self.embedding_dropout(embedded)
    # Forward through unidirectional GRU
    rnn_output,hidden = self.gru(embedded,last_hidden)
    # calculate attention weights from current GRU output
    attn_weights = self.attn(rnn_output,encoder_outputs)
    # Multiply attention weights to encoder outputs to get new "weighted sum" context vector
    context = attn_weights.bmm(encoder_outputs.transpose(0,1))
    # Concatenat eweighted context vector and GRU output using Luong eq
    rnn_output = rnn_output.squeeze(0)
    context = context.squeeze(1)
    concat_input = torch.cat((rnn_output,context),1)
    concat_output = torch.tanh(self.concat(concat_input))
    # Predict next word using Luong
    output = self.out(concat_output)
    output = F.softmax(output,dim=1)
    # Return output and final hidden state
    return output,hidden

### Defining Training Procedure

Masked Loss : Since dealing with batches of padded sequences , cannot consider all elements of tensor when calculating loss . Define "maskNLLLoss" to calculate loss based on decoder's output tensor , the target tensor and a binary mask tensor describing the padding of the target tensor. This loss function calculates the average negetive log liklihood of elements that correspond to a 1 in the mask tensor

In [0]:
def maskNLLLoss(inp,target,mask):
  nTotal = mask.sum()
  crossEntropy = -torch.log(torch.gather(inp,1,target.view(-1,1)))
  loss = crossEntropy.masked_select(mask).mean()
  loss = loss.to(device)
  return loss,nTotal.item()

** Single Iteration  Training**

train - algo for single training iteration ( single batch)

Tricks to aid in convergence:
 
 * **Teacher Forcing** - at some probability , set by **teacher_forcing_ratio**, use current target word as decoder's next input rather than using decoder's current guess . Aids in more efficient training , can cause problems during inference 
 
 * **Gradient Clipping ** - Counter the exploading gradient problem.
 
 
 Sequence Of Operations :
 
1. Forward pass entire input batch through encoder   
2. Initialize decoder inputs as SOS_token and hidden state as the encoder's final hidden state
3. Forward input batch sequence through decoder one time step at a time 
4. If teacher forcing: set next ecoder input as the current targer ; else:set next excoder input as the current decoder output
5. Calculate and accumulate loss
6. Perform backprop
7. Clip Gradients
8. Update encoder and decoder model parameters


* Pytorch's RNN models can be used by passing entire input seq or one time step at a time


In [0]:
def train(input_variable,lengths,target_variable,mask,max_target_len,encoder,decoder,
          embedding,encoder_optimizer,decoder_optimizer,batch_size,clip,max_length=MAX_LENGTH):
  # Zero Gradients 
  encoder_optimizer.zero_grad()
  decoder_optimizer.zero_grad()
  
  # set device options 
  input_variable = input_variable.to(device)
  lengths = lengths.to(device)
  target_variable = target_variable.to(device)
  mask = mask.to(device)
  
  # initialize variables
  loss = 0
  print_losses =[]
  n_totals=0
  
  # Forward Pass through encoder
  encoder_outputs , encoder_hidden = encoder(input_variable,lengths)
  
  # Create initial deccoder input (start with SOS_token for each sentence)
  decoder_input = torch.LongTensor([[SOS_token for _ in range(batch_size)]])
  decoder_input = decoder_input.to(device)
  
  # set initial decoder hidden state to the encoder's final hidden state
  decoder_hidden = encoder_hidden[:decoder.n_layers]
  
  # Determine if teacher forcing is used in this iteration
  use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
  
  # Forward batch of sequences through decoder one time step at a time
  if use_teacher_forcing:
    for t in range(max_target_len):
      decoder_output,decoder_hidden = decoder(decoder_input,decoder_hidden,encoder_outputs)
      # Teacher forcing is on thus next input is current target
      decoder_input = target_variable[t].view(1,-1)
      # calculate and accumulate loss
      mask_loss,nTotal = maskNLLLoss(decoder_output,target_variable[t],mask[t])
      loss +=mask_loss
      print_losses.append(mask_loss.item()*nTotal)
      n_totals += nTotal
  else:
    for t in range(max_target_len):
      decoder_output,decoder_hidden = decoder(decoder_input,decoder_hidden,encoder_outputs)
      # No teacher forcing : next input is decoder's own current output
      _, topi = decoder_output.topk(1)
      decoder_input = torch.LongTensor([[topi[i][0] for i in range(batch_size)]])
      decoder_input = decoder_input.to(device)
      # Calculate and accumulate loss
      mask_loss,nTotal = maskNLLLoss(decoder_output,target_variable[t],mask[t])
      loss +=mask_loss
      print_losses.append(mask_loss.item()*nTotal)
      n_totals += nTotal
  
  # Perform backprop
  loss.backward()
  
  # Clip gradients: _ modified in place 
  _ = torch.nn.utils.clip_grad_norm_(encoder.parameters(),clip)
  _ = torch.nn.utils.clip_grad_norm_(decoder.parameters(),clip)
  
  # Adjust model weights
  encoder_optimizer.step()
  decoder_optimizer.step()
  
  return sum(print_losses)/n_totals

** Training Iterations **

trainingIters -  running n_iterations given models , data , optimizers  

Saving model weights - encoder and decoder state_dict (parameters ) , optimizer's state_dicts , the loss , the iteration etc

In [0]:
def trainIters(model_name,voc,pairs,encoder,decoder,encoder_optimizer,decoder_optimizer,
               embedding,encoder_n_layers,decoder_n_layers,save_dir,n_iteration,batch_size,
              print_every,save_every,clip,corpus_name,loadFilename):
  
  # Load batches for each iteration
  training_batches = [batch2TrainData(voc,[random.choice(pairs) for _ in range(batch_size)]) for _ in range(n_iteration)]
  
  # Initilizing 
  print("initializing")
  start_iteration =1 
  print_loss =0
  if loadFilename:
    start_iteration = checkpoint['iteration']+1
 
  # Training Loop
  print("training")
  for iteration in range(start_iteration,n_iteration +1):
    training_batch = training_batches[iteration -1]
    # Extract fields from batch
    input_variable , lengths, target_variable, mask , max_target_len = training_batch
    
    # Run a training iteration with batch
    loss =  train(input_variable,lengths,target_variable,mask,max_target_len,encoder,
                  decoder,embedding,encoder_optimizer,decoder_optimizer,batch_size,clip)
    print_loss +=loss
    
    # Print Progress
    if iteration % print_every == 0:
      print_loss_avg = print_loss/print_every
      print(f"Iteration {iteration}; Percent Complete {iteration/n_iteration *100} ; Avg Loss {print_loss_avg}")
      print_loss =0
    
    # save checkpoint
    if iteration %save_every == 0:
      print('skipping saving for now ')
#       directory = os.path.join(save_dir,model_name,corpus_name,f"{encoder_n_layers}_{decoder_n_layers}_{hidden_size}")
#       if not os.path.exists(directory):
#         os.mkdir(directory)
#       torch.save({
#           'iteration':iteration,
#           'en':encoder.state_dict(),
#           'de':decoder.state_dict(),
#           'en_opt':encoder_optimizer.state_dict(),
#           'de_opt':decoder_optimizer.state_dict(),
#           'loss':loss,
#           'voc_dict':voc.__dict__,
#           'embedding':embedding.state_dict()
#       },os.path.join(dirctory,f'{iteration}_{checkpint}.tar'))

** Define Evaluation **

After training need to decode the encoded input 

**Greedy decoding **

method used during training when not using teacher forcing . For each time step simply choose the word from decoder_output wih highest softmax value ( Optimal on single time-step-level)

GreedySearchDecoder - takes input_seq ( shape - (input_seq_length ,1) ) , a scaler input_length tensor and a max_length to bound the response sentence length

Computation Graph


1.   Forward input through encoder model
2.   Prepare encoder's final hidden layer to be first hidden input to the decoder
3.   Initialize decoder's first input as SOS_token
4.   Initialize tensors to append decoded words to
5. **Iteratively decode one word token at a time **
        

        1. Forward pass through decoder
        2. Obtain most likely word token and its softmax score
        3. Record token and score
        4. Prepare current token to be next decoder input

6.   Return collections of word tokens and scores.





In [0]:
class GreedySearchDecoder(nn.Module):
  def __init__(self,encoder,decoder):
    super(GreedySearchDecoder,self).__init__()
    self.encoder = encoder
    self.decoder = decoder
    
  def forward(self,input_seq,input_length,max_length):
    # Forward input through encoder model 
    encoder_outputs , encoder_hidden = self.encoder(input_seq,input_length)
    # Prepare encoder's final hidden layer to be the first hidden input to the decoder
    decoder_hidden = encoder_hidden[:decoder.n_layers]
    # Initialize decoer input with SOS_token
    decoder_input = torch.ones(1,1,device=device,dtype=torch.long) * SOS_token
    # initialize tensors to append decoded words to 
    all_tokens = torch.zeros([0],device=device,dtype=torch.long)
    all_scores = torch.zeros([0],device=device)
    
    # Iteratively decode one word token at a time 
    for _ in range(max_length):
      # forward pass through decoder
      decoder_output,decoder_hidden = self.decoder(decoder_input,decoder_hidden,encoder_outputs)
      # Obtain most likelly word token and its softmax score
      decoder_scores, decoder_input = torch.max(decoder_output,dim=1)
      # Record token and score
      all_tokens = torch.cat((all_tokens,decoder_input),dim=0)
      all_scores = torch.cat((all_scores,decoder_scores),dim=0)
      #Prepare current token to be next decoder input (add a dimension)
      decoder_input = torch.unsqueeze(decoder_input,0)
    # return collections of word tokens and scores
    return all_tokens,all_scores
  

** Evaluate My Text **


Functions for evaluating a string input sentence .

evaluate - manage the low level processing of handling the input sentence . Format the sentence as an input batch of word indexes with batch_size =1 by converting words of a sentence to their corresponding indexes and transposing the dimensions 

Also create a lengths tensor which contains the length of the input sentence , this case lengths a scalar as only evaluating one sentence at a time ( batch_size==1) 

Next obtain the decoded response sentence tensor using GreedySearchDecoder object . finally convert the response indexes to words and return decoded words

evaluateInput - UI for the chatbot , enter query sentence . After text is normalized in the same way as training data it is fed to evaluate function to obtain a decoded output. Looping this function till q or quit is entered

If sentence contains word not in vocaulary then print error message

In [0]:
def evaluate(encoder,decoder,searcher,voc,sentence,max_length=MAX_LENGTH):
  # Format input sentence as a batch 
  # words -> indexes
  indexes_batch = [indexesFromSentence(voc,sentence)]
  # Create lengths tensor
  lengths = torch.tensor([len(indexes) for indexes in indexes_batch])
  # Transpose dimensions of batch to match model's expectations 
  input_batch = torch.LongTensor(indexes_batch).transpose(0,1)
  # Use appropriate device
  input_batch= input_batch.to(device)
  lengths = lengths.to(device)
  # decode sentences with searcher
  tokens,scores = searcher(input_batch,lengths,max_length)
  # indexes -> words
  decoded_words = [voc.index2word[token.item()] for token in tokens]
  return decoded_words

def evaluateInput(encoder,decoder,searcher,voc):
  input_sentence = ''
  while(1):
    try:
      # Get input sentence
      input_sentence = input('> ')
      # check if quit case
      if input_sentence == 'q' or input_sentence =='quit':break
      #Normalize sentence
      input_sentence = normalizeString(input_sentence)
      # evaluate sentence
      output_words = evaluate(encoder,decoder,searcher,voc,input_sentence)
      # Format and print response
      output_words[:] = [x for x in output_words if not(x=='EOS' or x =='PAD')]
      print('BOT :',''.join(output_words))
    except Exception as ex:
      print("Error: encounterd unknown word")

** Run Model**

Setting configurations 

In [45]:
# configure models 
model_name = 'cb_model'
attn_model = 'dot'

hidden_size = 500
encoder_n_layers =3
decoder_n_layers =3
dropout=0.1
batch_size = 64

#checkpoint to oad from 
loadFilename =None
checkpoint_iter =4000

# Load model if model loadFileName is provided
if loadFilename:
  checkpoint = torch.load(loadFilename)
  encoder_sd = checkpoint['en']
  decoder_sd = checkpoint['de']
  encoder_optimizer_sd = checkpoint['en_opt']
  decoder_optimizer_sd = checkpoint['de_opt']
  embedding_sd = checkpoint['embedding']
  voc.__dict__ = checkpoint['voc_dict']
  
print('Building encoder and decoder')

# initailize word embedding
embedding = nn.Embedding(voc.num_words,hidden_size)
if loadFilename:
  embedding.load_state_dict(embedding_sd)
# Initialize encoder and decoder models
encoder = EncoderRNN(hidden_size,embedding,encoder_n_layers,dropout)
decoder = LuongAttnDecoderRNN(attn_model,embedding,hidden_size,voc.num_words,decoder_n_layers,dropout)

if loadFilename:
  encoder.load_state_dict(encoder_sd)
  decoder.load_state_dict(decoder_sd)

# setting device 
encoder = encoder.to(device)
decoder = decoder.to(device)

print('Models built ')

Building encoder and decoder
Models built 


** Training the model**

In [50]:
# configure training/optimization
clip = 50.0
teacher_forcing_ratio = 1.0
learning_rate = 0.0001
decoder_learning_ratio = 5.0
n_iteration = 4000
print_every = 10
save_every = 500

# ensure dropout layers are tain mode 
encoder.train()
decoder.train()

# Initiaize optimizers 
print('Building optimizers')
encoder_optimizer = optim.Adam(encoder.parameters(),lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(),lr=learning_rate)

if loadFilename:
  encoder_optimizer.load_state_dict(encoder_optimizer_sd)
  decoder_optimizer.load_state_dict(decoder_optimizer_sd)

print('Starting training')
trainIters(model_name,voc,pairs,encoder,decoder,encoder_optimizer,decoder_optimizer,
               embedding,encoder_n_layers,decoder_n_layers,save_dir,n_iteration,batch_size,
              print_every,save_every,clip,corpus_name,loadFilename)

Building optimizers
Starting training
initializing
training
Iteration 10; Percent Complete 0.25 ; Avg Loss 3.7026574423813905
Iteration 20; Percent Complete 0.5 ; Avg Loss 3.661538867176567
Iteration 30; Percent Complete 0.75 ; Avg Loss 3.606323970893451
Iteration 40; Percent Complete 1.0 ; Avg Loss 3.691636301809086
Iteration 50; Percent Complete 1.25 ; Avg Loss 3.6285671456545714
Iteration 60; Percent Complete 1.5 ; Avg Loss 3.5709345787061793
Iteration 70; Percent Complete 1.7500000000000002 ; Avg Loss 3.601548357626309
Iteration 80; Percent Complete 2.0 ; Avg Loss 3.6081869173212686
Iteration 90; Percent Complete 2.25 ; Avg Loss 3.5915704634440644
Iteration 100; Percent Complete 2.5 ; Avg Loss 3.605436584652491
Iteration 110; Percent Complete 2.75 ; Avg Loss 3.663495206419826
Iteration 120; Percent Complete 3.0 ; Avg Loss 3.5958588502503916
Iteration 130; Percent Complete 3.25 ; Avg Loss 3.574765725814595
Iteration 140; Percent Complete 3.5000000000000004 ; Avg Loss 3.6564385856011

Run Evaluation

In [0]:
# set dropout layers to eval
encoder.eval()
decoder.eval()

# Initialize search module
searcher = GreedySearchDecoder(encoder,decoder)



In [72]:
# Begin chatting
evaluateInput(encoder,decoder,searcher,voc)

> Hi model
BOT : hi.
> what the weather like
BOT : yourenotgoingtodoit.
> what am i 
BOT : yourenotgoingtodo.
> ok
BOT : imsorry.
> for what
BOT : thelittletime.
> what little time 
BOT : imsorry.
> it's ok
BOT : what?
> what is not ok
BOT : yourenotgoingtodoit.
> did you watch the movie
BOT : imnot.
> looks like you are stuck
BOT : imsorry.
> yup local mininma
Error: encounterd unknown word
> q


NoneType: None
