# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





# Install required packages
**Restart the kernel after running these cells**

In [1]:
# !pip install torch==1.12.0 torchdata==0.4.0 torchtext==0.13.0

In [2]:
# !pip install nb_black

# Load the data

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import torchtext
import pandas as pd
from torchtext.datasets import SQuAD1


def loadDF(path):
    # Load the train and validation data from the SQuAD1 dataset
    train_data, val_data = SQuAD1(path)

    # Create a dictionary to store the questions and answers
    data_dict = {"Question": [], "Answer": []}

    # Extract the questions and answers from the train and validation data
    for data in (train_data, val_data):
        for _, question, answers, _ in data:
            data_dict["Question"].append(question)
            data_dict["Answer"].append(answers[0])

    # Convert the data dictionary to a pandas dataframe
    df = pd.DataFrame(data_dict)

    return df

<IPython.core.display.Javascript object>

In [3]:
# select only a portion of SQuAD1 dataset as it is really huge and Cuda runs out of memory
data_df = loadDF("data").iloc[:6000, :]
data_df

Unnamed: 0,Question,Answer
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary
...,...,...
5995,How many publications voted The College Dropou...,2
5996,What was the name of the single off the debut ...,Jesus Walks
5997,What label did Kanye create following the succ...,GOOD Music
5998,When was The College Dropout finally released?,February 2004


<IPython.core.display.Javascript object>

In [4]:
# print the first 5 question-answer pairs 
for idx, row in data_df.head(5).iterrows():
    question = "".join(row["Question"])
    answer = "".join(row["Answer"])
    print(f"> {question}\n< {answer}\n")

> To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
< Saint Bernadette Soubirous

> What is in front of the Notre Dame Main Building?
< a copper statue of Christ

> The Basilica of the Sacred heart at Notre Dame is beside to which structure?
< the Main Building

> What is the Grotto at Notre Dame?
< a Marian place of prayer and reflection

> What sits on top of the Main Building at Notre Dame?
< a golden statue of the Virgin Mary



<IPython.core.display.Javascript object>

# Prepare the data (clean, tokenize etc.)

In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import string

nltk.download("punkt")
nltk.download("stopwords")

snowball_stemmer = nltk.stem.snowball.SnowballStemmer("english")


def prepare_text(sentence):
    # remove punctuations and convert to lowercase
    sentence = "".join([s.lower() for s in sentence if s not in string.punctuation])
    sentence = " ".join(snowball_stemmer.stem(w) for w in sentence.split())

    # tokenize the sentence
    tokens = nltk.tokenize.RegexpTokenizer(r"\w+").tokenize(sentence)

    return tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<IPython.core.display.Javascript object>

In [6]:
# Tokenize the questions and answers in the DataFrame using the `prepare_text` function
data_df["Question"] = data_df["Question"].apply(prepare_text)
data_df["Answer"] = data_df["Answer"].apply(prepare_text)

# Print the first 5 pairs of tokenized questions and answers
for i in range(5):
    q_tokens = " ".join(
        data_df.loc[i, "Question"]
    )  # Join the question tokens into a single string
    a_tokens = " ".join(
        data_df.loc[i, "Answer"]
    )  # Join the answer tokens into a single string
    print(f"Q: {q_tokens}\nA: {a_tokens}\n")

Q: to whom did the virgin mari alleg appear in 1858 in lourd franc
A: saint bernadett soubir

Q: what is in front of the notr dame main build
A: a copper statu of christ

Q: the basilica of the sacr heart at notr dame is besid to which structur
A: the main build

Q: what is the grotto at notr dame
A: a marian place of prayer and reflect

Q: what sit on top of the main build at notr dame
A: a golden statu of the virgin mari



<IPython.core.display.Javascript object>

# Manipulate the data and return representative training and test dataset but also the corresponding pairs to be used later

In [7]:
# define vocabulary classes for source and target sentences
src_token = 0
trg_token = 1


class build_vocab:
    def __init__(self):
        self.word2index = {"": src_token, "": trg_token}
        self.index2word = {src_token: "", trg_token: ""}
        self.word_count = len(self.word2index)

    def add_words(self, sentc):
        # add new words to the vocabulary
        split_sentence = sentc.split(" ")
        for word in split_sentence:
            if word not in self.word2index:
                self.word2index[word] = self.word_count
                self.index2word[self.word_count] = word
                self.word_count += 1

<IPython.core.display.Javascript object>

In [8]:
import torch

# use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


def to_tensor(vocab, sentc):
    # convert a sentence to a torch tensor of indices
    split_sentence = sentc.split(" ")
    indices = [vocab.word2index[word] for word in split_sentence]
    indices.append(vocab.word2index[""])
    return torch.Tensor(indices).long().to(device).view(-1, 1)

<IPython.core.display.Javascript object>

In [9]:
import random


def train_test_split(SRC, TRG, test_size=0.2, random_state=None):
    # concatenate each source and target sentence into a single string
    data_src_list = SRC.apply(lambda x: " ".join(x)).to_list()
    data_trg_list = TRG.apply(lambda x: " ".join(x)).to_list()

    # merge source and target sentences into pairs
    pairs = [list(i) for i in zip(data_src_list, data_trg_list)]

    # shuffle the pairs randomly
    if random_state is not None:
        random.seed(random_seed)
    #     random.shuffle(pairs)

    # split the pairs into train and test sets
    num_pairs = len(pairs)
    split_idx = int(num_pairs * (1 - test_size))
    train_pairs, test_pairs = pairs[:split_idx], pairs[split_idx:]

    # split the train and test pairs into separate source and target lists
    train_data_src, train_data_trg = zip(*train_pairs)
    test_data_src, test_data_trg = zip(*test_pairs)

    # build vocabularies for source and target sentences
    src_vocab = build_vocab()
    trg_vocab = build_vocab()

    for pair in pairs:
        src_vocab.add_words(pair[0])
        trg_vocab.add_words(pair[1])

    # convert source and target sentences to tensors for training and testing
    SRC_train_dataset = [to_tensor(src_vocab, s) for s in train_data_src]
    TRG_train_dataset = [to_tensor(trg_vocab, t) for t in train_data_trg]
    SRC_test_dataset = test_data_src
    TRG_test_dataset = test_data_trg

    return (
        SRC_train_dataset,
        SRC_test_dataset,
        TRG_train_dataset,
        TRG_test_dataset,
        pairs,
    )

<IPython.core.display.Javascript object>

In [10]:
(
    SRC_train_dataset,
    SRC_test_dataset,
    TRG_train_dataset,
    TRG_test_dataset,
    pairs,
) = train_test_split(data_df["Question"], data_df["Answer"], test_size=0.2)

len(SRC_train_dataset), len(SRC_test_dataset), len(TRG_train_dataset), len(
    TRG_test_dataset
)

(4800, 1200, 4800, 1200)

<IPython.core.display.Javascript object>

# Get the max target output and the Source and Target vocabularies

In [11]:
# calculate the max target length
max_trg = 0
for p in pairs:
    max_trg = len(p[1].split()) if len(p[1].split()) > max_trg else max_trg

# build vocabularies for source and target sentences
src_vocab = build_vocab()
trg_vocab = build_vocab()

for pair in pairs:
    src_vocab.add_words(pair[0])
    trg_vocab.add_words(pair[1])

max_trg

43

<IPython.core.display.Javascript object>

# Define the network

In [12]:
import torch.nn as nn


# define the Encoder class, which inherits from the nn.Module class
class Encoder(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(Encoder, self).__init__()

        self.input_size = input_size
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(self.input_size, self.hidden_size)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)

    # define the forward method, which takes in the input sequence, hidden state, and cell state
    def forward(self, input_seq, hidden, cell_state):
        embedded = self.embedding(input_seq)
        embedded = embedded.view(1, 1, -1)
        embedded, (hidden, cell_state) = self.lstm(embedded, (hidden, cell_state))
        return embedded, hidden, cell_state


# define the Decoder class, which also inherits from the nn.Module class
class Decoder(nn.Module):
    def __init__(self, hidden_dim, output_dim):
        super(Decoder, self).__init__()

        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

        self.embedding_layer = nn.Embedding(output_dim, hidden_dim)
        self.lstm_layer = nn.LSTM(hidden_dim, hidden_dim)
        self.fc_layer = nn.Linear(hidden_dim, output_dim)
        self.softmax_layer = nn.LogSoftmax(dim=1)

    # define the forward method, which takes in the input sequence, hidden state, and cell state
    def forward(self, input_seq, hidden_state, cell_state):
        embedded_input = self.embedding_layer(input_seq)
        embedded_input = embedded_input.view(1, 1, -1)
        output, (hidden_state, cell_state) = self.lstm_layer(
            embedded_input, (hidden_state, cell_state)
        )
        output = self.softmax_layer(self.fc_layer(output[0]))
        return output, hidden_state, cell_state


# define the Seq2Seq class, which also inherits from the nn.Module class
class Seq2Seq(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(Seq2Seq, self).__init__()

        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

        self.encoder = Encoder(self.input_dim, self.hidden_dim)
        self.decoder = Decoder(self.hidden_dim, self.output_dim)

    # define the forward method, which takes in the source sequence, source length, target length, and teacher forcing probability
    def forward(self, src_seq, src_len, trg_len, teacher_forcing_prob=1):
        outputs = {"decoder_output": []}

        encoder_hidden = torch.zeros([1, 1, self.hidden_dim]).to(
            device
        )  # 1 = number of LSTM layers
        encoder_cell = torch.zeros([1, 1, self.hidden_dim]).to(device)

        for t in range(src_len):
            encoder_output, encoder_hidden, encoder_cell = self.encoder(
                src_seq[t], encoder_hidden, encoder_cell
            )

        decoder_input = torch.Tensor([[0]]).long().to(device)
        decoder_hidden = encoder_hidden

        for t in range(trg_len):
            decoder_output, decoder_hidden, encoder_cell = self.decoder(
                decoder_input, decoder_hidden, encoder_cell
            )
            outputs["decoder_output"].append(decoder_output)

            if self.training:
                decoder_input = (
                    target_tensor[t]
                    if random.random() > teacher_forcing_prob
                    else decoder_output.argmax(1)
                )  # teacher forcing
            else:
                _, top_index = decoder_output.data.topk(1)
                decoder_input = top_index.squeeze().detach()

        return outputs

<IPython.core.display.Javascript object>

# Train the model

In [13]:
lr = 0.01
hidden_dim = 128
epochs = 70
batch_size = 128

<IPython.core.display.Javascript object>

In [14]:
seq2seq = Seq2Seq(src_vocab.word_count, hidden_dim, trg_vocab.word_count)

<IPython.core.display.Javascript object>

In [15]:
from sklearn.model_selection import KFold


def train_model(
    source_data, target_data, model, num_epochs, batch_size, output_every, lr
):
    # move the model to the selected device
    model.to(device)

    # initialize total training and validation losses to 0
    final_training_loss = 0
    final_validation_loss = 0

    # initialize the loss
    loss = 0

    # define the optimizer and criterion
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    criterion = nn.NLLLoss()

    # use K-Fold cross-validation
    kf = KFold(n_splits=num_epochs, shuffle=True)

    # iterate through each epoch
    for epoch, (train_idx, val_idx) in enumerate(kf.split(source_data), 1):
        # Set the model in training mode
        model.train()

        # iterate through each batch in the training set
        for i in range(0, len(train_idx)):
            src = source_data[train_idx[i]]
            trg = target_data[train_idx[i]]

            # forward pass through the model
            output = model(src, src.size(0), trg.size(0))

            # calculate the loss
            current_loss = 0
            for s, t in zip(output["decoder_output"], trg):
                current_loss += criterion(s, t)

            loss += current_loss
            final_training_loss += current_loss.item() / trg.size(0)

            # update the model parameters every batch_size iterations
            if i % batch_size == 0 or i == (len(train_idx) - 1):
                loss.backward()
                optimizer.step()
                optimizer.zero_grad()
                loss = 0

        # set the model in evaluation mode
        model.eval()

        # iterate through each batch in the validation set
        for i in range(0, len(val_idx)):
            src = source_data[val_idx[i]]
            trg = target_data[val_idx[i]]

            # forward pass through the model
            output = model(src, src.size(0), trg.size(0))

            # calculate the loss
            current_loss = 0
            for s, t in zip(output["decoder_output"], trg):
                current_loss += criterion(s, t)

            final_validation_loss += current_loss.item() / trg.size(0)

        # print the training and validation losses every output_every epochs
        if epoch % output_every == 0:
            training_loss_average = final_training_loss / (
                len(train_idx) * output_every
            )
            validation_loss_average = final_validation_loss / (
                len(val_idx) * output_every
            )
            print(
                "{}/{} Epoch  -  Training Loss = {:.4f}  -  Validation Loss = {:.4f}".format(
                    epoch, num_epochs, training_loss_average, validation_loss_average
                )
            )
            final_training_loss = 0
            final_validation_loss = 0

<IPython.core.display.Javascript object>

In [16]:
train_model(
    source_data=SRC_train_dataset,
    target_data=TRG_train_dataset,
    model=seq2seq,
    output_every=5,
    num_epochs=epochs,
    lr=lr,
    batch_size=batch_size,
)

5/70 Epoch  -  Training Loss = 5.9434  -  Validation Loss = 5.7051
10/70 Epoch  -  Training Loss = 5.3602  -  Validation Loss = 5.2128
15/70 Epoch  -  Training Loss = 5.1354  -  Validation Loss = 5.1155
20/70 Epoch  -  Training Loss = 4.7919  -  Validation Loss = 4.8140
25/70 Epoch  -  Training Loss = 4.3363  -  Validation Loss = 4.3695
30/70 Epoch  -  Training Loss = 3.7777  -  Validation Loss = 3.9699
35/70 Epoch  -  Training Loss = 3.1183  -  Validation Loss = 3.6024
40/70 Epoch  -  Training Loss = 2.4468  -  Validation Loss = 2.7335
45/70 Epoch  -  Training Loss = 1.7866  -  Validation Loss = 2.2015
50/70 Epoch  -  Training Loss = 1.1642  -  Validation Loss = 1.4879
55/70 Epoch  -  Training Loss = 0.6709  -  Validation Loss = 0.8617
60/70 Epoch  -  Training Loss = 0.3799  -  Validation Loss = 0.5668
65/70 Epoch  -  Training Loss = 0.2134  -  Validation Loss = 0.3143
70/70 Epoch  -  Training Loss = 0.1270  -  Validation Loss = 0.1540


<IPython.core.display.Javascript object>

# Save the trained model

In [17]:
import torch

# define the file path for saving the model
model_path = "seq2seq.pt"

# save the trained Seq2Seq model
torch.save(seq2seq, model_path)

# load the saved Seq2Seq model and set it to evaluation mode
seq2seq = torch.load(model_path, map_location=torch.device("cuda"))
seq2seq.eval()

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(5082, 128)
    (lstm): LSTM(128, 128)
  )
  (decoder): Decoder(
    (embedding_layer): Embedding(5201, 128)
    (lstm_layer): LSTM(128, 128)
    (fc_layer): Linear(in_features=128, out_features=5201, bias=True)
    (softmax_layer): LogSoftmax(dim=1)
  )
)

<IPython.core.display.Javascript object>

# Evaluate the pre-trained model

In [25]:
def model_eval(src, src_vocab, trg_vocab, model, trg_max_len):
    try:
        src = to_tensor(src_vocab, " ".join(prepare_text(src)))
    except:
        print("Can you please explain to me what do you mean?")
        return

    asnwer = []

    output = model(src, src.size(0), trg_max_len)

    for tensor in output["decoder_output"]:
        _, top_token = tensor.data.topk(1)
        if top_token.item() == 1:
            break
        else:
            word = trg_vocab.index2word[top_token.item()]
            asnwer.append(word)

    print("<", " ".join(asnwer), "\n")

<IPython.core.display.Javascript object>

## Evaluate it in the test dataset

In [26]:
for question, exp_answer in zip(SRC_test_dataset[:100], TRG_test_dataset[:100]):
    print(f"Input --> {question}")
    print(f"Expected --> {exp_answer}")
    print("Predicted: ")
    model_eval(question, src_vocab, trg_vocab, seq2seq, max_trg)
    print("\n")

Input --> in 2007 which presid award lee the presidenti medal of freedom
Expected --> georg w bush
Predicted: 
< georg w 



Input --> a movi adapt of the book was releas in what year
Expected --> 1962
Predicted: 
Can you please explain to me what do you mean?


Input --> who play atticus finch in the 1962 movi of the same titl
Expected --> gregori peck
Predicted: 
< tom 



Input --> which actor receiv an oscar for his role of atticus finch in the 1962 movi of the book
Expected --> gregori peck
Predicted: 
< mayella and 



Input --> what item did lee give the actor gregori peck after portray atticus finch
Expected --> father pocketwatch
Predicted: 
< lenox hill hospit 



Input --> which one of gregori peck relat was name after harper lee
Expected --> grandson
Predicted: 
< beyoncé 



Input --> what person effect did lee give to peck
Expected --> her father pocketwatch
Predicted: 
< woyciechowski 



Input --> which one of peck relat was name harper in honor of lee
Expected --> gran

< 1971 



Input --> in 2002 the sun provid more energi in one hour than human use in what span of time
Expected --> one year
Predicted: 
< horseback combat 



Input --> how much energi in exajoul doe photosynthesi captur each year
Expected --> 3000
Predicted: 
< 220 million 



Input --> twice the amount of energi obtain by all the nonrenew sourc on earth can be provid by the sun in what span of time
Expected --> one year
Predicted: 
< 14 



Input --> what is the amount of solar energi absorb by the earth
Expected --> approxim 3850000 exajoul ej per year
Predicted: 
< 2005 



Input --> how much solar energi is captur by photosynthesi
Expected --> approxim 3000 ej per year
Predicted: 
< 21 



Input --> the amount of solar energi per year is twice as much as the energi that will ever be produc from what resourc
Expected --> coal oil natur gas and mine uranium combin
Predicted: 
< 20 



Input --> where do the major of renew energi deriv their energi from
Expected --> the sun
Predict

<IPython.core.display.Javascript object>

## Evaluate it in manual user's input

In [31]:
random.randint(1, 6000)

5458

<IPython.core.display.Javascript object>

In [34]:
data_df_raw.head(5)

Unnamed: 0,Question,Answer
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,What is the Grotto at Notre Dame?,a Marian place of prayer and reflection
4,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary


<IPython.core.display.Javascript object>

In [37]:
# data_df_raw = loadDF("data").iloc[:6000, :]

# print 3 random question-answer pairs to test
for idx, row in data_df_raw.iloc[random.sample(range(1, 6000), 3), :].iterrows():
    question = "".join(row["Question"])
    answer = "".join(row["Answer"])
    print(f"> {question}\n< {answer}\n")

> How was she dressed on the cover of L'Officiel?
< in blackface and tribal makeup

> Where did Frédéric and Sand venture to after Majorca became unlivable when it was discovered they were not married?
< Valldemossa

> How many square miles are land in NYC?
< 304.8



<IPython.core.display.Javascript object>

In [38]:
print(f"Write 'stop' to stop the chat interaction\n{'-'* 35}\n")
while True:
    src_input = input("> ")
    if src_input.strip() == "stop":
        break
    model_eval(src_input, src_vocab, trg_vocab, seq2seq, max_trg)

Write 'stop' to stop the chat interaction
-----------------------------------

> How was she dressed on the cover of L'Officiel?
< in blackfac and tribal makeup 

> How many square miles are land in NYC?
< 3048 

> Where did Frédéric and Sand venture to after Majorca became unlivable when it was discovered they were not married?
< valldemossa 

> stop


<IPython.core.display.Javascript object>