# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import sklearn
import random
from nltk.corpus import brown
from torchtext.datasets import SQuAD1
from nltk.tokenize import RegexpTokenizer
from torch import nn
from torch.utils.tensorboard import SummaryWriter

In [2]:
nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

[nltk_data] Downloading package brown to /home/AdmUser/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /home/AdmUser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
def loadDF(train_iter):
    '''

    You will use this function to load the dataset into a Pandas Dataframe for processing.

    '''
    df = {"question": [], "answer": []}
    index = 0
    for context, question, answers, indices in train_iter:
        if answers[0]:
            df["question"].append(question)
            df["answer"].append(answers[0])
        index += 1
    return pd.DataFrame.from_dict(df)

In [4]:
train_data, test_data = SQuAD1(root='.', split=('train','dev'))
train_data = loadDF(train_data)
test_data = loadDF(test_data)

In [5]:
class Vocab:
    def __init__(self, name,trimMinValue):
        self.name = name
        self.index = {0:"<sos>", 1:"<eos>", 2:"<pad>", 3:"<unk>"}
        self.words = {"<sos>":0, "<eos>":1, "<pad>":2, "<unk>":3}
        self.wordsCounter = {"<sos>":0, "<eos>":0, "<pad>":0, "<unk>":0}
        self.count = 4
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.trimMinValue = trimMinValue
    
    def indexWord(self, word):
        if word not in self.words:
            self.words[word] = self.count
            self.wordsCounter[word] = 1
            self.index[self.count] = word
            self.count += 1
        else:
            self.wordsCounter[word] += 1
    
    def addSentence(self, sentence):
        tokens = self.tokenizeSentence(sentence)
        for token in tokens:
            token = token.lower()
            self.indexWord(token)
    
    def tokenizeSentence(self, sentence): 
        return self.tokenizer.tokenize(sentence)
    
    def trimVocab(self):
        trimmedIndex = {0:"<sos>", 1:"<eos>", 2:"<pad>", 3:"<unk>"}
        trimmedWords = {"<sos>":0, "<eos>":1, "<pad>":2, "<unk>":3}
        trimmedWordsCounter = {"<sos>":0, "<eos>":0, "<pad>":0, "<unk>":0}
        trimmedCount = 4
        for i in range(4,self.count):
            if self.wordsCounter[self.index[i]] >= self.trimMinValue:
                trimmedWords[self.index[i]] = trimmedCount
                trimmedWordsCounter[self.index[i]] = self.wordsCounter[self.index[i]]
                trimmedIndex[trimmedCount] = self.index[i]
                trimmedCount += 1
        self.index = trimmedIndex
        self.words = trimmedWords
        self.wordsCounter = trimmedWordsCounter
        self.count = trimmedCount
    
        
    def prepareSentence(self, sentence):
        tokens = self.tokenizeSentence(sentence)
        sentence = []
        for token in tokens:
            token = token.lower()
            if token in self.words:
                sentence.append(token)
            else:
                sentence.append("<unk>")
        sentence.insert(0, "<sos>")
        sentence.append("<eos>")
        
        return sentence
    
    def paddSentence(self, sentence, maxlen):
        paddedSentence = []
        if len(sentence)>maxlen:
            for i in range(maxlen-1):
                paddedSentence.append(sentence[i])
            paddedSentence.append("<eos>")
        elif len(sentence)<=maxlen:
            paddedSentence = sentence
            for i in range(len(sentence),maxlen):
                paddedSentence.append("<pad>")
        return paddedSentence
    
    def paddSentences(self, sentences, maxlen):
        paddedSentences = []
        for sentence in sentences:
            paddedSentences.append(self.paddSentence(sentence, maxlen))
        return paddedSentences
        
    def indexSentence(self,sentence):
        return [self.words[w] for w in sentence]
    
    def wordSentence(self,sentence):
        return [self.index[w] for w in sentence]
    
            

In [6]:
vocab = Vocab(name='SQuAD1Vocab', trimMinValue = 10)

In [7]:
#Add train and test data to vocab

for i,r, in train_data.iterrows():
  question_text = vocab.addSentence(r["question"])
  answer_text = vocab.addSentence(r["answer"])
for i,r, in test_data.iterrows():
  question_text = vocab.addSentence(r["question"])
  answer_text = vocab.addSentence(r["answer"])
print("Added {} words to our vocabulary".format(vocab.count))

Added 55995 words to our vocabulary


In [8]:
#Trim vocab

vocab.trimVocab()
print("Remain {} words in our vocabulary after trimming".format(vocab.count))

Remain 9959 words in our vocabulary after trimming


In [9]:
#Prepare train and test sentences
trainQ = []
trainA = []
testQ = []
testA = []

for i,r, in train_data.iterrows():
    trainQ.append(vocab.prepareSentence(r["question"]))
    trainA.append(vocab.prepareSentence(r["answer"]))
for i,r, in test_data.iterrows():
    testQ.append(vocab.prepareSentence(r["question"]))
    testA.append(vocab.prepareSentence(r["answer"]))


In [10]:
print(trainQ[3])
#Padd sentences
maxlen = 10
trainQ = vocab.paddSentences(trainQ, maxlen)
trainA = vocab.paddSentences(trainA, maxlen)
testQ = vocab.paddSentences(testQ, maxlen)
testA = vocab.paddSentences(testA, maxlen)
print(trainQ[3])

['<sos>', 'what', 'is', 'the', '<unk>', 'at', 'notre', 'dame', '<eos>']
['<sos>', 'what', 'is', 'the', '<unk>', 'at', 'notre', 'dame', '<eos>', '<pad>']


In [11]:
#Tokenize sentences
trainQtok = []
trainAtok = []
testQtok = []
testAtok = []

for sentence in trainQ:
    trainQtok.append(vocab.indexSentence(sentence))
for sentence in trainA:
    trainAtok.append(vocab.indexSentence(sentence))
for sentence in testQ:
    testQtok.append(vocab.indexSentence(sentence))
for sentence in testA:
    testAtok.append(vocab.indexSentence(sentence))

print(trainQtok[0])

[0, 4, 5, 6, 7, 8, 9, 10, 11, 1]


In [12]:
def get_batches(questions, answers, batch_size):

    n_batches = len(questions)//batch_size
    
    # only full batches
    questions = questions[:n_batches*batch_size]
    answers = answers[:n_batches*batch_size]

    for idx in range(0, len(questions), batch_size):
        questions_batch, answers_batch = [], []
        questions_batch = np.array(questions[idx:idx+batch_size])
        answers_batch = np.array(answers[idx:idx+batch_size])
        yield questions_batch, answers_batch

In [13]:

class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, num_layers, p):
        
        super(Encoder, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        self.num_layers = num_layers

        self.dropout = nn.Dropout(p)
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, dropout = p, batch_first = True)

    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        embedding = self.dropout(self.embedding(i))

        o, (h, c) = self.lstm(embedding)
        
        return h, c


class Decoder(nn.Module):
      
    def __init__(self, hidden_size, embedding_size, output_size, num_layers, p):
        
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        self.output_size = output_size
        self.num_layers = num_layers

        self.dropout = nn.Dropout(p)
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(self.output_size, self.embedding_size)
        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, self.num_layers, dropout = p, batch_first = True)
        # self.ouput, predicts on the hidden state via a linear output layer 
        self.fc = nn.Linear(self.hidden_size, self.output_size)    
        
    def forward(self, i, h, c):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        i = i.unsqueeze(-1)
        embedding = self.dropout(self.embedding(i))

        o, (h, c) = self.lstm(embedding, (h, c))

        pred = self.fc(o)

        pred = pred.squeeze(1)

        return pred, h, c
        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder, decoder, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_size

        prediction = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        h, c = self.encoder(src)

        #sos token -> trg[batch_size, seq_len]
        i = trg[:, 0]

        for t in range(1, trg_len):
            o, h, c = self.decoder(i, h, c)
            prediction[:, t] = o

            teacher_force = random.random() < teacher_forcing_ratio
            top1 = o.argmax(1) 
            i = trg[:, t] if teacher_force else top1
        
        return prediction

In [14]:
num_epochs = 50
batch_size = 256
learning_rate = 0.01

input_size = vocab.count
output_size = vocab.count
embedding_size = 256
hidden_size = 512
num_layers = 2
p_dropout = 0.3
teacher_forcing_ratio = 0.5

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

enc = Encoder(input_size, hidden_size, embedding_size, num_layers, p_dropout)
dec = Decoder(hidden_size, embedding_size, output_size, num_layers, p_dropout)

model = Seq2Seq(enc, dec, device).to(device)
print(model)

cuda
Seq2Seq(
  (encoder): Encoder(
    (dropout): Dropout(p=0.3, inplace=False)
    (embedding): Embedding(9959, 256)
    (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.3)
  )
  (decoder): Decoder(
    (dropout): Dropout(p=0.3, inplace=False)
    (embedding): Embedding(9959, 256)
    (lstm): LSTM(256, 512, num_layers=2, batch_first=True, dropout=0.3)
    (fc): Linear(in_features=512, out_features=9959, bias=True)
  )
)


In [15]:

optimizer = torch.optim.Adam(model.parameters(),lr=learning_rate)
criterion = nn.CrossEntropyLoss(ignore_index = vocab.words['<pad>'])
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, "min", verbose=True, threshold=0.01)

In [16]:
def train(model, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    l = 0
    for x, y in get_batches(trainQtok, trainAtok, batch_size):
        
        inputs = torch.from_numpy(x).to(device)
        targets = torch.from_numpy(y).to(device)

        optimizer.zero_grad()
        
        output = model(inputs, targets, teacher_forcing_ratio)

        #output [batch size, trg len, output dim]->[batch size * trg len, output dim]
        #targets [batch size, trg len, output dim]->[batch size * trg len]
        #ignore first token
        loss = criterion(output[1:].view(-1, output.shape[-1]), targets[1:].view(-1))
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
        l += 1
        
    return epoch_loss / l

In [17]:
def evaluate(model, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
        l = 0
        for x, y in get_batches(testQtok, testAtok, batch_size):

            inputs = torch.from_numpy(x).to(device)
            targets = torch.from_numpy(y).to(device)

            output = model(inputs, targets, teacher_forcing_ratio = 0)

            loss = criterion(output[1:].view(-1, output.shape[-1]), targets[1:].view(-1))
            
            epoch_loss += loss.item()

            l += 1
        
    return epoch_loss / l

In [18]:
#Tensorboard
writer = SummaryWriter(log_dir="./runs")

In [19]:
best_train_loss = float('inf')
clip = 1
save_path = 'chatbot_model.pt'
for epoch in range(num_epochs):
    train_loss = train(model, optimizer, criterion, clip)
    scheduler.step(train_loss)
    
    if train_loss < best_train_loss:
        best_train_loss = train_loss
        torch.save(model.state_dict(), save_path)

    print("Epoch: " + str(epoch) + " | TrainLoss: " + str(train_loss) + " | BestTrainLoss: " + str(best_train_loss))
    writer.add_scalar("Loss/train", train_loss, epoch)
    

writer.flush()
writer.close()

Epoch: 0 | TrainLoss: 7.503294228113186 | BestTrainLoss: 7.503294228113186
Epoch: 1 | TrainLoss: 7.317041139156498 | BestTrainLoss: 7.317041139156498
Epoch: 2 | TrainLoss: 6.7162404659895865 | BestTrainLoss: 6.7162404659895865
Epoch: 3 | TrainLoss: 6.709279183058711 | BestTrainLoss: 6.709279183058711
Epoch: 4 | TrainLoss: 6.715953029387179 | BestTrainLoss: 6.709279183058711
Epoch: 5 | TrainLoss: 6.623720691915144 | BestTrainLoss: 6.623720691915144
Epoch: 6 | TrainLoss: 6.517546980004561 | BestTrainLoss: 6.517546980004561
Epoch: 7 | TrainLoss: 6.4598141525223935 | BestTrainLoss: 6.4598141525223935
Epoch: 8 | TrainLoss: 6.456882379208392 | BestTrainLoss: 6.456882379208392
Epoch: 9 | TrainLoss: 6.528585609636809 | BestTrainLoss: 6.456882379208392
Epoch: 10 | TrainLoss: 6.54226643579048 | BestTrainLoss: 6.456882379208392
Epoch: 11 | TrainLoss: 6.487820318567823 | BestTrainLoss: 6.456882379208392
Epoch: 12 | TrainLoss: 6.4762889335030005 | BestTrainLoss: 6.456882379208392
Epoch: 13 | TrainL

In [20]:
model.load_state_dict(torch.load('chatbot_model.pt'))
test_loss = evaluate(model, criterion)
print("TestLoss: " + str(test_loss))

TestLoss: 6.404298619526188
