# LSTM Bot

## Project Overview

In this project, you will build a chatbot to converse with you on a variety of different questions. The chatbot will use a Sequence to Sequence text generation model with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. Compare the performance of your model with pre-trained embeddings versus without.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Multi30K and Squad datasets first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
pip install torchdata==0.3.0 torchvision==0.12.0 torchtext==0.12.0 torch

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata==0.3.0
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 2.6 MB/s eta 0:00:01
[?25hCollecting torchvision==0.12.0
  Downloading torchvision-0.12.0-cp37-cp37m-manylinux1_x86_64.whl (21.0 MB)
[K     |████████████████████████████████| 21.0 MB 9.5 MB/s eta 0:00:01
Installing collected packages: torchdata, torchvision
Successfully installed torchdata-0.3.0 torchvision-0.12.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
from torchtext import datasets

In [3]:
def loadDF():
    data = {"question": [], "answer": []}
    index = 0
    train_iter, dev_iter = datasets.SQuAD2()
    for context, question, answers, indices in train_iter:
        if answers[0]:
            data["question"].append(question)
            data["answer"].append(answers[0])
        index += 1
    df =  pd.DataFrame.from_dict(data)
    return df
#### note: this function is from a comment on the forum here - https://knowledge.udacity.com/questions/888774

In [4]:
data = loadDF()

In [5]:
import nltk
from nltk.tokenize import RegexpTokenizer

def prepare_text(sentence):
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    #tokens = word_tokenize(sentence)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(sentence)
    tokens = [token.lower() for token in tokens]
    return tokens

In [6]:
data['question_tokens'] = data['question'].apply(prepare_text)
data['answer_tokens'] = data['answer'].apply(prepare_text)

In [7]:
data

Unnamed: 0,question,answer,question_tokens,answer_tokens
0,When did Beyonce start becoming popular?,in the late 1990s,"[when, did, beyonce, start, becoming, popular]","[in, the, late, 1990s]"
1,What areas did Beyonce compete in when she was...,singing and dancing,"[what, areas, did, beyonce, compete, in, when,...","[singing, and, dancing]"
2,When did Beyonce leave Destiny's Child and bec...,2003,"[when, did, beyonce, leave, destiny, s, child,...",[2003]
3,In what city and state did Beyonce grow up?,"Houston, Texas","[in, what, city, and, state, did, beyonce, gro...","[houston, texas]"
4,In which decade did Beyonce become famous?,late 1990s,"[in, which, decade, did, beyonce, become, famous]","[late, 1990s]"
...,...,...,...,...
86816,In what US state did Kathmandu first establish...,Oregon,"[in, what, us, state, did, kathmandu, first, e...",[oregon]
86817,What was Yangon previously known as?,Rangoon,"[what, was, yangon, previously, known, as]",[rangoon]
86818,With what Belorussian city does Kathmandu have...,Minsk,"[with, what, belorussian, city, does, kathmand...",[minsk]
86819,In what year did Kathmandu create its initial ...,1975,"[in, what, year, did, kathmandu, create, its, ...",[1975]


In [8]:
from sklearn.model_selection import train_test_split
def split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    
    SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset = train_test_split(SRC, TRG, test_size=0.2, random_state=42)
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


In [9]:
SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset = split(data['question_tokens'], data['answer_tokens'])

In [10]:
SRC_train_dataset

34614                         [who, shot, queen, victoria]
84775    [how, many, people, are, on, the, territorial,...
75487            [where, does, sr, 94, merge, with, i, 15]
24088    [when, did, natural, bronze, start, to, be, us...
36068    [who, was, the, final, king, of, the, attalid,...
                               ...                        
6265                  [what, is, the, biggest, known, dog]
54886             [what, encoding, does, charis, sil, use]
76820            [who, made, the, demonstration, in, 1943]
860      [at, what, age, did, frédéric, start, giving, ...
15795       [where, is, corruption, even, more, prevalent]
Name: question_tokens, Length: 69456, dtype: object

In [11]:
TRG_train_dataset

34614                            [roderick, maclean]
84775                                     [nineteen]
75487                                  [at, miramar]
24088                                     [5500, bc]
36068                                 [attalus, iii]
                            ...                     
6265                              [english, mastiff]
54886    [graphite, opentype, or, aat, technologies]
76820                              [luria, delbrück]
860                                              [7]
15795                     [non, privatized, sectors]
Name: answer_tokens, Length: 69456, dtype: object

In [12]:
class Vocabulary:
    def __init__(self):
        self.word2index = {}
        self.word2count = {}
        self.index2word = {}
        self.num_words = 0
        
        self.add_token('<UNK>')

    def add_token(self, token):
        if token not in self.word2index:
            self.word2index[token] = self.num_words
            self.word2count[token] = 1
            self.index2word[self.num_words] = token
            self.num_words += 1
        else:
            self.word2count[token] += 1

    def add_tokens(self, tokens):
        for token in tokens:
            self.add_token(token)
            
    def discard_rare_words(self, min_count):
        tokens_to_remove = []
        for token in self.word2count:
            if self.word2count[token] < min_count:
                tokens_to_remove.append(token)

        for token in tokens_to_remove:
            del self.word2index[token]
            del self.word2count[token]

        self.index2word = {index: token for token, index in self.word2index.items()}
        self.num_words = len(self.word2index)

    def __len__(self):
        return self.num_words

    def __str__(self):
        return f"Vocabulary size: {self.num_words}"

    def token_to_index(self, token):
        return self.word2index.get(token, self.word2index['<UNK>'])

    def index_to_token(self, index):
        return self.index2word.get(index, '<UNK>')

    def get_token_count(self, token):
        return self.word2count.get(token, 0)

In [22]:
vocabulary = Vocabulary()
vocabulary_src = Vocabulary()
vocabulary_trg = Vocabulary()

In [23]:
for row in SRC_train_dataset:
    vocabulary.add_tokens(row)
    vocabulary_src.add_tokens(row)

In [24]:
for row in TRG_train_dataset:
    vocabulary.add_tokens(row)
    vocabulary_trg.add_tokens(row)

In [16]:
print(vocabulary)
print(vocabulary.token_to_index('how'))
print(vocabulary.index_to_token(3))
print(vocabulary.get_token_count('how'))

Vocabulary size: 47603
5
queen
7670


In [17]:
vocabulary.discard_rare_words(2)
print(vocabulary)

Vocabulary size: 26810


In [27]:
len(vocabulary_src)

33335

In [32]:
import torch.nn as nn
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        
        super(Encoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.input_size = input_size
        
        self.hidden = torch.zeros(1, 1, hidden_size)
        
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(self.input_size, self.hidden_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size, 1) 
        
    
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        embedded = self.embedding(i)
        o, (h, c) = self.lstm(embedded)
        
        return o, h, c
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(output_size, self.hidden_size)
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)

        # self.ouput, predicts on the hidden state via a linear output layer   
        self.output = nn.Linear(self.hidden_size, self.output_size)
        
    def forward(self, i, h):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        embedded = self.embedding(i)  # Embed the target vector
        embedded = embedded.unsqueeze(0)  # Add a batch dimension

        o, h = self.lstm(embedded, h)  # Pass the embedded input and previous hidden state through the LSTM

        o = o.squeeze(0)  # Remove the batch dimension from the output
        o = self.output(o)
        
        return o, h
        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size):
        
        super(Seq2Seq, self).__init__()
        
        self.input_size = encoder_input_size
        self.hidden_size = encoder_hidden_size
        self.output_size = decoder_output_size
        
        self.encoder = Encoder(self.input_size, self.hidden_size)
        self.decoder = Decoder(self.hidden_size, self.output_size)
        
        assert self.encoder.hidden_size == self.decoder.hidden_size, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert self.encoder.lstm.num_layers == self.decoder.lstm.num_layers, \
            "Encoder and decoder must have equal number of layers!"
    
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
        encoder_outputs, encoder_hidden, encoder_cell = self.encoder(src)
    
        decoder_hidden = encoder_hidden
        decoder_cell = encoder_cell

        decoder_input = trg[0]
        
        o = torch.zeros(trg.shape[0], self.decoder.output_size)
    
        for t in range(1, trg.shape[0]):
            decoder_output, decoder_hidden = self.decoder(decoder_input, decoder_hidden)
            o[t] = decoder_output

            use_teacher_forcing = random.random() < teacher_forcing_ratio
            
            if use_teacher_forcing:
                decoder_input = trg[t]
            else:
                decoder_input = decoder_output.argmax(dim=1)
        
        
        return o

    



In [33]:
INPUT_DIM = len(vocabulary_src)
OUTPUT_DIM = len(vocabulary_trg)
HID_DIM = 512

In [34]:
enc = Encoder(INPUT_DIM,HID_DIM)
dec = Decoder(HID_DIM, OUTPUT_DIM)

In [39]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Seq2Seq(INPUT_DIM, HID_DIM, HID_DIM, OUTPUT_DIM).to(device)

In [40]:
model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(33335, 512)
    (lstm): LSTM(512, 512)
  )
  (decoder): Decoder(
    (embedding): Embedding(33741, 512)
    (lstm): LSTM(512, 512)
    (output): Linear(in_features=512, out_features=33741, bias=True)
  )
)

In [41]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(33335, 512)
    (lstm): LSTM(512, 512)
  )
  (decoder): Decoder(
    (embedding): Embedding(33741, 512)
    (lstm): LSTM(512, 512)
    (output): Linear(in_features=512, out_features=33741, bias=True)
  )
)

In [42]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 55,854,541 trainable parameters


In [44]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())

In [45]:
criterion = nn.CrossEntropyLoss()

In [58]:
SRC_train_dataset

34614                         [who, shot, queen, victoria]
84775    [how, many, people, are, on, the, territorial,...
75487            [where, does, sr, 94, merge, with, i, 15]
24088    [when, did, natural, bronze, start, to, be, us...
36068    [who, was, the, final, king, of, the, attalid,...
                               ...                        
6265                  [what, is, the, biggest, known, dog]
54886             [what, encoding, does, charis, sil, use]
76820            [who, made, the, demonstration, in, 1943]
860      [at, what, age, did, frédéric, start, giving, ...
15795       [where, is, corruption, even, more, prevalent]
Name: question_tokens, Length: 69456, dtype: object

In [93]:
questions_list_train = SRC_train_dataset.tolist()
questions_list_test = SRC_test_dataset.tolist()
answers_list_train = TRG_train_dataset.tolist()
answers_list_test = TRG_test_dataset.tolist()

train_data = list(zip(questions_list_train, answers_list_train))
test_data = list(zip(questions_list_test, answers_list_test))



import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from torchtext.data.utils import get_tokenizer

class QADataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        question, answer = self.data[index]
        
        # Tokenize question and answer sequences
        tokenized_question = self.tokenizer(question)
        tokenized_answer = self.tokenizer(answer)

        # Convert tokenized sequences to tensors
        question_tensor = torch.tensor([self.tokenizer.vocab.stoi[token] for token in tokenized_question])
        answer_tensor = torch.tensor([self.tokenizer.vocab.stoi[token] for token in tokenized_answer])

        # Apply padding to tokenized sequences
        padded_question = pad_sequence([question_tensor], batch_first=True)
        padded_answer = pad_sequence([answer_tensor], batch_first=True)

        return padded_question.squeeze(0), padded_answer.squeeze(0)

# Define your tokenizer
tokenizer = get_tokenizer('basic_english')

# Create the train and test datasets
train_data = list(zip(questions_list_train, answers_list_train))
test_data = list(zip(questions_list_test, answers_list_test))

train_dataset = QADataset(train_data, tokenizer)
test_dataset = QADataset(test_data, tokenizer)

batch_size = 32

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size)
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size)

In [94]:
model.train()
num_epochs = 3

for epoch in range(num_epochs):
    for batch_idx, (src, trg) in enumerate(train_dataloader):
        src = src.to(device)
        trg = trg.to(device)

        optimizer.zero_grad()

        output = model(src, trg)

        loss = criterion(output, trg)

        loss.backward()
        optimizer.step()

        if batch_idx % log_interval == 0:
            print(f"Epoch: {epoch+1}, Batch: {batch_idx+1}/{len(train_dataloader)}, Loss: {loss.item()}")

AttributeError: 'list' object has no attribute 'lower'