# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
# Please restart the kernal after running this cell
!pip install torch==1.12.0 torchdata==0.4.0 torchtext==0.13.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torch==1.12.0
  Downloading torch-1.12.0-cp37-cp37m-manylinux1_x86_64.whl (776.3 MB)
[K     |████████████████████████████████| 776.3 MB 12 kB/s s eta 0:00:01                           | 1.9 MB 5.1 MB/s eta 0:02:33
[?25hCollecting torchdata==0.4.0
  Downloading torchdata-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 30.4 MB/s eta 0:00:01
[?25hCollecting torchtext==0.13.0
  Downloading torchtext-0.13.0-cp37-cp37m-manylinux1_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 29.7 MB/s eta 0:00:01
Collecting portalocker>=2.0.0
  Downloading portalocker-2.6.0-py2.py3-none-any.whl (15 kB)
[31mERROR: torchvision 0.10.0 has requirement torch==1.9.0, but you'll have torch 1.12.0 which is incompatible.[0m
Installing collected packages: torch, portalocker, torchdata, torchtext
Successfully installed portalocker-2.6.0 torch-1.12.0 t

In [1]:
from src.Data import loadDF, prepare_text, getPairs, toTensor, getMaxLen
from src.Models import Seq2Seq
from src.Vocab import Vocab
from src.Train import train
from src.Evaluate import evaluate
import random

In [2]:
learning_rate = 0.01
hidden_size = 128 # encoder and decoder hidden size
batch_size = 128
epochs = 65

In [3]:
data_df = loadDF('data')
# I will take only the first 5,000 Q&A to avoid CUDA out of memory error due to the large dataset
data_df = data_df.iloc[:5000, :]

In [4]:
for i in range(0, 5): # first 5 Q&A
    print("> ", data_df.iloc[i,0], "\n< ", data_df.iloc[i,1], "\n") 

>  To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France? 
<  Saint Bernadette Soubirous 

>  What is in front of the Notre Dame Main Building? 
<  a copper statue of Christ 

>  The Basilica of the Sacred heart at Notre Dame is beside to which structure? 
<  the Main Building 

>  What is the Grotto at Notre Dame? 
<  a Marian place of prayer and reflection 

>  What sits on top of the Main Building at Notre Dame? 
<  a golden statue of the Virgin Mary 



In [5]:
data_df['Question'] = data_df['Question'].apply(prepare_text)
data_df['Answer'] = data_df['Answer'].apply(prepare_text)

In [6]:
pairs = getPairs(data_df)

In [7]:
max_src, max_trg = getMaxLen(pairs)
max_trg, max_src

(43, 29)

In [8]:
Q_vocab = Vocab()
A_vocab = Vocab()

# build vocabularies for questions "source" and answers "target"
for pair in pairs:
    Q_vocab.add_words(pair[0])
    A_vocab.add_words(pair[1])

In [9]:
source_data = [toTensor(Q_vocab, pair[0]) for pair in pairs]
target_data = [toTensor(A_vocab, pair[1]) for pair in pairs]

In [10]:
seq2seq = Seq2Seq(Q_vocab.words_count, hidden_size, A_vocab.words_count)

train(source_data = source_data,
      target_data = target_data,
      model = seq2seq,
      print_every = 5,
      epochs = epochs,
      learning_rate = learning_rate,
      batch_size = batch_size)


5/65 Epoch  -  Training Loss = 5.7164  -  Validation Loss = 5.6037
10/65 Epoch  -  Training Loss = 5.2685  -  Validation Loss = 5.4477
15/65 Epoch  -  Training Loss = 5.0755  -  Validation Loss = 5.2686
20/65 Epoch  -  Training Loss = 4.8186  -  Validation Loss = 5.0039
25/65 Epoch  -  Training Loss = 4.4987  -  Validation Loss = 4.7094
30/65 Epoch  -  Training Loss = 4.1168  -  Validation Loss = 4.3646
35/65 Epoch  -  Training Loss = 3.6321  -  Validation Loss = 3.9487
40/65 Epoch  -  Training Loss = 3.1023  -  Validation Loss = 3.5066
45/65 Epoch  -  Training Loss = 2.5964  -  Validation Loss = 3.0633
50/65 Epoch  -  Training Loss = 1.9913  -  Validation Loss = 2.4387
55/65 Epoch  -  Training Loss = 1.4860  -  Validation Loss = 1.8890
60/65 Epoch  -  Training Loss = 1.0196  -  Validation Loss = 1.3978
65/65 Epoch  -  Training Loss = 0.6504  -  Validation Loss = 0.8777


In [11]:
import torch

model_path = 'seq2seq.pt'

torch.save(seq2seq, model_path)

seq2seq = torch.load(model_path, map_location=torch.device('cuda'))
seq2seq.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(4504, 128)
    (lstm): LSTM(128, 128)
  )
  (decoder): Decoder(
    (embedding): Embedding(4079, 128)
    (lstm): LSTM(128, 128)
    (fc): Linear(in_features=128, out_features=4079, bias=True)
    (softmax): LogSoftmax(dim=1)
  )
)

In [18]:
print("Type 'exit' to finish the chat.\n", "-"*30, '\n')
while (True):
    src = input("> ")
    if src.strip() == "exit":
        break
    evaluate(src, Q_vocab, A_vocab, seq2seq, max_trg)

Type 'exit' to finish the chat.
 ------------------------------ 

> hi
Error: Word Encountered Not In The Vocabulary.
> Which prize did Frederick Buechner create? 
< buechner prize for preach 

> What is the Grotto at Notre Dame?
< a marian place of prayer and reflect 

> What is in front of the Notre Dame Main Building? 
< a copper statu of christ 

> exit
