# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





# References

・https://torch.classcat.com/2018/05/15/pytorch-tutorial-intermediate-seq2seq-translation/ 

・https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

In [1]:
# Restart after run
!pip install torchtext==0.9.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torchtext==0.9.0
  Downloading torchtext-0.9.0-cp37-cp37m-manylinux1_x86_64.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 5.0 MB/s eta 0:00:01
Collecting torch==1.8.0
  Downloading torch-1.8.0-cp37-cp37m-manylinux1_x86_64.whl (735.5 MB)
[K     |████████████████████████████████| 735.5 MB 9.2 kB/s  eta 0:00:01
[31mERROR: torchvision 0.10.0 has requirement torch==1.9.0, but you'll have torch 1.8.0 which is incompatible.[0m
Installing collected packages: torch, torchtext
Successfully installed torch-1.8.0 torchtext-0.9.0


In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown
from torchtext.datasets import SQuAD1
import string
import torch.nn as nn
import random 
from sklearn.model_selection import KFold

stemmer = nltk.stem.snowball.SnowballStemmer('english')

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Load SQuAD1.0 Data

In [2]:
def loadDF(path):
    
    dataset_train, dataset_dev = SQuAD1(root = path, split = ('train', 'dev'))

    df_train = pd.DataFrame.from_dict(dataset_train)
    df_dev = pd.DataFrame.from_dict(dataset_dev)
    
    df = df_train.append(df_dev)
    
    return df

In [3]:
df = loadDF('.data')

feature = ["Sentence", "Question", "Answer", "?"]
df.columns = feature

df.head()

Unnamed: 0,Sentence,Question,Answer,?
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,[Saint Bernadette Soubirous],[515]
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,[a copper statue of Christ],[188]
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,[the Main Building],[279]
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,[a Marian place of prayer and reflection],[381]
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,[a golden statue of the Virgin Mary],[92]


In [4]:
df = df[["Question", "Answer"]]

df.head()

Unnamed: 0,Question,Answer
0,To whom did the Virgin Mary allegedly appear i...,[Saint Bernadette Soubirous]
1,What is in front of the Notre Dame Main Building?,[a copper statue of Christ]
2,The Basilica of the Sacred heart at Notre Dame...,[the Main Building]
3,What is the Grotto at Notre Dame?,[a Marian place of prayer and reflection]
4,What sits on top of the Main Building at Notre...,[a golden statue of the Virgin Mary]


In [5]:
# The amount of data was too large and the results were not available after one day.
# We trained with a smaller amount of data.
df = df.iloc[:5000, :]

# Clean Data

In [6]:
def prepare_text(sentence):

    sentence = ''.join([s.lower() for s in sentence if s not in string.punctuation])
    sentence = ' '.join(stemmer.stem(w) for w in sentence.split())
    tokens = nltk.tokenize.RegexpTokenizer(r'\w+').tokenize(sentence)

    return tokens

In [7]:
df['Question'] = df['Question'].apply(prepare_text)
df['Answer'] = df['Answer'].apply(prepare_text)
df.head()

Unnamed: 0,Question,Answer
0,"[to, whom, did, the, virgin, mari, alleg, appe...","[saint, bernadett, soubir]"
1,"[what, is, in, front, of, the, notr, dame, mai...","[a, copper, statu, of, christ]"
2,"[the, basilica, of, the, sacr, heart, at, notr...","[the, main, build]"
3,"[what, is, the, grotto, at, notr, dame]","[a, marian, place, of, prayer, and, reflect]"
4,"[what, sit, on, top, of, the, main, build, at,...","[a, golden, statu, of, the, virgin, mari]"


In [8]:
def getPairs(df):

    temp1 = df['Question'].apply(lambda x: " ".join(x) ).to_list()
    temp2 = df['Answer'].apply(lambda x: " ".join(x) ).to_list()
    
    return [list(i) for i in zip(temp1, temp2)]

In [9]:
pairs = getPairs(df)

In [10]:
def getMaxLen(pairs):
    
    max_src = 0 
    max_trg = 0
    
    for p in pairs:
        max_src = len(p[0].split()) if len(p[0].split()) > max_src else max_src
        max_trg = len(p[1].split()) if len(p[1].split()) > max_trg else max_trg
        
    return max_src, max_trg

In [11]:
max_src, max_trg = getMaxLen(pairs)
max_trg, max_src

(43, 29)

In [12]:
SOS_token = 0
EOS_token = 1

class Chatbot:
    def __init__(self):
        self.word2index = {"": SOS_token, "": EOS_token}
        self.index2word = {SOS_token: "", EOS_token: ""}
        self.words_count = len(self.word2index)

    def add_words(self, sentence):
        for word in sentence.split(" "):
            if word not in self.word2index:
                self.word2index[word] = self.words_count
                self.index2word[self.words_count] = word
                self.words_count += 1

In [13]:
SRC = Chatbot()
TRG = Chatbot()

for pair in pairs:
    SRC.add_words(pair[0])
    TRG.add_words(pair[1])

In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def toTensor(chatbot, sentence):

    indices = [chatbot.word2index[word] for word in sentence.split(' ')]
    indices.append(chatbot.word2index[''])
    
    return torch.Tensor(indices).long().to(device).view(-1, 1)

In [15]:
source_data = [toTensor(SRC, pair[0]) for pair in pairs]
target_data = [toTensor(TRG, pair[1]) for pair in pairs]

# Create Seq2Seq model

In [16]:
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.embedding = nn.Embedding(self.input_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)

    def forward(self, x, hidden, cell_state):
        
        x = self.embedding(x).view(1, 1, -1)
        x, (hidden, cell_state) = self.lstm(x, (hidden, cell_state))
        
        return x, hidden, cell_state
        

class Decoder(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.embedding = nn.Embedding(output_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)
        self.fc = nn.Linear(self.hidden_size, self.output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, x, hidden, cell_state):
        
        x = self.embedding(x).view(1, 1, -1)
        x, (hidden, cell_state) = self.lstm(x, (hidden, cell_state))
        x = self.softmax(self.fc(x[0]))
        
        return x, hidden, cell_state
    
     
class Seq2Seq(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Seq2Seq, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.encoder = Encoder(self.input_size, self.hidden_size)
        self.decoder = Decoder(self.hidden_size, self.output_size)
        
    def forward(self, src, trg, src_len, trg_len, teacher_force=1):
        
        output = {
            'decoder_output':[]
        }
        
        encoder_hidden = torch.zeros([1, 1, self.hidden_size]).to(device) 
        cell_state = torch.zeros([1, 1, self.hidden_size]).to(device)  
        
        for i in range(src_len):
            encoder_output, encoder_hidden, cell_state = self.encoder(src[i], encoder_hidden, cell_state)

        decoder_input = torch.Tensor([[0]]).long().to(device) 
        decoder_hidden = encoder_hidden
        
        for i in range(trg_len):
            decoder_output, decoder_hidden, cell_state = self.decoder(decoder_input, decoder_hidden, cell_state)
            output['decoder_output'].append(decoder_output)
            
            if self.training: 
                decoder_input = target_tensor[i] if random.random() > teacher_force else decoder_output.argmax(1) 
            else:
                _, top_index = decoder_output.data.topk(1)
                decoder_input = top_index.squeeze().detach()
                
        return output

# Train

In [17]:
def train(source_data, target_data, model, epochs, batch_size, print_every, learning_rate):
    
    model.to(device)
    total_training_loss = 0
    total_valid_loss = 0
    loss = 0
    
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    criterion = nn.NLLLoss()

    kf = KFold(n_splits=epochs, shuffle=True)

    for e, (train_index, test_index) in enumerate(kf.split(source_data), 1):
        model.train()
        for i in range(0, len(train_index)):

            src = source_data[i]
            trg = target_data[i]

            output = model(src, trg, src.size(0), trg.size(0))

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            loss += current_loss
            total_training_loss += (current_loss.item() / trg.size(0))

            if i % batch_size == 0 or i == (len(train_index)-1):
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
                loss = 0

        model.eval()
        for i in range(0, len(test_index)):
            src = source_data[i]
            trg = target_data[i]

            output = model(src, trg, src.size(0), trg.size(0))

            current_loss = 0
            for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

            total_valid_loss += (current_loss.item() / trg.size(0))

        if e % print_every == 0:
            training_loss_average = total_training_loss / (len(train_index)*print_every)
            validation_loss_average = total_valid_loss / (len(test_index)*print_every)
            print("{}/{} Epoch  -  Training Loss = {:.4f}  -  Validation Loss = {:.4f}".format(e, epochs, training_loss_average, validation_loss_average))
            total_training_loss = 0
            total_valid_loss = 0 

In [18]:
learning_rate = 0.01
hidden_size = 128
batch_size = 128
epochs = 50

seq2seq = Seq2Seq(SRC.words_count, hidden_size, TRG.words_count)

train(source_data = source_data,
      target_data = target_data,
      model = seq2seq,
      epochs = epochs,
      batch_size = batch_size,
      print_every = 5,
      learning_rate = learning_rate)

5/50 Epoch  -  Training Loss = 5.8499  -  Validation Loss = 5.8281
10/50 Epoch  -  Training Loss = 5.3663  -  Validation Loss = 5.4062
15/50 Epoch  -  Training Loss = 5.0204  -  Validation Loss = 5.0839
20/50 Epoch  -  Training Loss = 4.5684  -  Validation Loss = 4.5774
25/50 Epoch  -  Training Loss = 4.0227  -  Validation Loss = 4.0918
30/50 Epoch  -  Training Loss = 3.3892  -  Validation Loss = 3.5543
35/50 Epoch  -  Training Loss = 2.6756  -  Validation Loss = 2.9853
40/50 Epoch  -  Training Loss = 2.0059  -  Validation Loss = 2.2922
45/50 Epoch  -  Training Loss = 1.3721  -  Validation Loss = 1.6603
50/50 Epoch  -  Training Loss = 0.8429  -  Validation Loss = 1.0303


# Evaluate

In [19]:
def evaluate(src, SRC, TRG, model, target_max_len):
    
    try:
        src = toTensor(SRC, " ".join(prepare_text(src)))
    except:
        print("Error: I don't know!.")
        return
    
    answer_words = []
    
    output = model(src, None, src.size(0), target_max_len)

    for tensor in output['decoder_output']:

        _, top_token = tensor.data.topk(1)
        if top_token.item() == 1:
            break
        else:
            word = TRG.index2word[top_token.item()]
            answer_words.append(word)
            
    print("<", ' '.join(answer_words), "\n")

In [20]:
torch.save(seq2seq, 'seq2seq.pt')

seq2seq = torch.load('seq2seq.pt', map_location=torch.device('cuda'))
seq2seq.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4504, 128)
    (lstm): LSTM(128, 128)
  )
  (decoder): Decoder(
    (embedding): Embedding(4062, 128)
    (lstm): LSTM(128, 128)
    (fc): Linear(in_features=128, out_features=4062, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
)

# ex, Question & Answer List

In [21]:
df = loadDF('.data')

feature = ["Sentence", "Question", "Answer", "?"]
df.columns = feature

df = df[["Question", "Answer"]]

for i in range(0, 10): 
    print("> ", df.iloc[i,0], "\n< ", df.iloc[i,1], "\n") 

>  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 
<  ['Saint Bernadette Soubirous'] 

>  What is in front of the Notre Dame Main Building? 
<  ['a copper statue of Christ'] 

>  The Basilica of the Sacred heart at Notre Dame is beside to which structure? 
<  ['the Main Building'] 

>  What is the Grotto at Notre Dame? 
<  ['a Marian place of prayer and reflection'] 

>  What sits on top of the Main Building at Notre Dame? 
<  ['a golden statue of the Virgin Mary'] 

>  When did the Scholastic Magazine of Notre dame begin publishing? 
<  ['September 1876'] 

>  How often is Notre Dame's the Juggler published? 
<  ['twice'] 

>  What is the daily student paper at Notre Dame called? 
<  ['The Observer'] 

>  How many student news papers are found at Notre Dame? 
<  ['three'] 

>  In what year did the student paper Common Sense begin publication at Notre Dame? 
<  ['1987'] 



In [22]:
print("Type 'exit' to finish the chat.\n", "-"*50, '\n')
while (True):
    src = input("> ")
    if src.strip() == "exit":
        break
    evaluate(src, SRC, TRG, seq2seq, max_trg)

Type 'exit' to finish the chat.
 -------------------------------------------------- 

> To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
<  

> What is in front of the Notre Dame Main Building? 
< a copper of of a 

> The Basilica of the Sacred heart at Notre Dame is beside to which structure?
< the main build 

> What is the Grotto at Notre Dame?
< a marian place place in in 

> What sits on top of the Main Building at Notre Dame?
< a golden statu of of of 

> exit
