# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
#!pip uninstall torch torchvision torchaudio -y

In [3]:
#!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

In [4]:
#!pip install torch==1.7.1+cu110 torchvision==0.8.2+cu110 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

In [5]:
#!pip install dataloader
#from torch.utils.data import Dataset, DataLoader
!nvidia-smi

Mon Apr 10 22:40:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+---------------------------------------------------------------------------

In [6]:
!pip show torch

Name: torch
Version: 1.7.1+cu110
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /root/.local/lib/python3.7/site-packages
Requires: typing-extensions, numpy
Required-by: torchvision, torchaudio, torchtext


In [7]:
import torch;
torch.__version__

'1.7.1+cu110'

In [8]:
import pandas as pd
import json
import string

In [9]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from pandas.io.json import json_normalize 
from torch.utils.data import TensorDataset, DataLoader

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

def clean_text(x):
    if len(x) > 0:
        return x[0]["text"]

def loadDF(path):
    
    df = pd.read_json(path) 
    data = pd.json_normalize(data = df['data'],
                            record_path =['paragraphs', 'qas'])
    data = data.drop('id', axis=1)
    data = data.drop('is_impossible', axis=1)
    data = data.drop('plausible_answers', axis=1)
    data['answers'] = data['answers'].apply(clean_text)
    
    return data

def tokenize_en(text):
    if text != None:
        text = text.translate(str.maketrans('', '', string.punctuation))
        return [tok.lower().strip() for tok in word_tokenize(text)]
    return;

#, dev_test
def prepare_text(df_train):
    
    for i,r, in df_train.iterrows():
        answerlist = r['answers']
        if answerlist != None:
            answerlist = [i for i in answerlist if i]
            if len(answerlist) == 0:
                r['answers'] =  np.NaN
                     
    df_train = df_train.dropna(axis=0, subset=['answers'])
    df_train.reset_index(drop=True)
   
    
    df_train['question_token'] = df_train['question'].apply(tokenize_en)
    df_train['answer_token'] = df_train['answers'].apply(tokenize_en) 
    
    #dev_test['question_token'] = dev_test['question'].apply(tokenize_en)
    #dev_test['answer_token'] = dev_test['answers'].apply(tokenize_en) 
    
    return df_train #, dev_test



def train_test_split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
df_train = loadDF('./train-v2.0.json')

In [11]:
dev_test = loadDF('./dev-v2.0.json')

In [12]:
dev_test.head()

Unnamed: 0,question,answers
0,In what country is Normandy located?,France
1,When were the Normans in Normandy?,10th and 11th centuries
2,From which countries did the Norse originate?,"Denmark, Iceland and Norway"
3,Who was the Norse leader?,Rollo
4,What century did the Normans first gain their ...,10th century


In [13]:
#pd.set_option('display.max_rows', None)

In [14]:
df_train.head()

Unnamed: 0,question,answers
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s


In [15]:
df_train = prepare_text(df_train)  #,  dev_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [16]:
df_train.head()

Unnamed: 0,question,answers,question_token,answer_token
0,When did Beyonce start becoming popular?,in the late 1990s,"[when, did, beyonce, start, becoming, popular]","[in, the, late, 1990s]"
1,What areas did Beyonce compete in when she was...,singing and dancing,"[what, areas, did, beyonce, compete, in, when,...","[singing, and, dancing]"
2,When did Beyonce leave Destiny's Child and bec...,2003,"[when, did, beyonce, leave, destinys, child, a...",[2003]
3,In what city and state did Beyonce grow up?,"Houston, Texas","[in, what, city, and, state, did, beyonce, gro...","[houston, texas]"
4,In which decade did Beyonce become famous?,late 1990s,"[in, which, decade, did, beyonce, become, famous]","[late, 1990s]"


In [17]:
word2VectorDict = w2v.wv.key_to_index 

In [18]:
SOS_token = 55000
EOS_token = 55001
PAD_IDX = 55002

class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {"SOS": 55000, "EOS": 55001, "PAD_IDX": 55002}
        self.word2count = {}
        self.index2word = {55000: "SOS", 55001: "EOS", 55002: "PAD_IDX"}
        self.n_words = 3  

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
            

    def addWord(self, word):
        if word not in self.word2index:
                if word in word2VectorDict:
                    indexOfWord = word2VectorDict[word]
                    self.word2index[word] = indexOfWord
                    self.index2word[indexOfWord] = word
                    self.n_words += 1
                else:
                    self.word2index[word] = self.n_words
                    self.index2word[self.n_words] = word
                    self.n_words += 1
                self.word2count[word] = 1
        else:
            self.word2count[word] += 1

In [19]:
lang = Lang(name='alltokens')
Questionlang = Lang(name='allQuestiontokens')
Answerlang = Lang(name='allAnswertokens')

In [20]:
#df_train.head(5000)

In [21]:
#print(df_train.size)
df_train = df_train.head(150)
#print(df_train.size)

In [22]:
len(df_train.index)

150

In [23]:
for i,r, in df_train.iterrows():
    text = r['question_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)
            Questionlang.addWord(t)

for i,r, in df_train.iterrows():
    text = r['answer_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)
            Answerlang.addWord(t)

In [24]:
#lang.word2index

In [25]:
#max(lang.word2count, key=lang.word2count.get)

In [26]:
'''for i,r, in dev_test.iterrows():
    text = r['question_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)

for i,r, in dev_test.iterrows():
    text = r['answer_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)*/'''

"for i,r, in dev_test.iterrows():\n    text = r['question_token']\n    if text != None and text != '':\n        text = [i for i in text if i]\n        for t in text:\n            lang.addWord(t)\n\nfor i,r, in dev_test.iterrows():\n    text = r['answer_token']\n    if text != None and text != '':\n        text = [i for i in text if i]\n        for t in text:\n            lang.addWord(t)*/"

In [27]:
import torch
device = torch.device(
  'cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [28]:
max_question_word_count = 10
max_answer_word_count = 5

In [29]:
def indexesFromSentence(lang, sentencelist):
    if sentencelist != None and sentencelist != '':
        sentencelist = [i for i in sentencelist if i]
        if len(sentencelist) > 0:
            return [lang.word2index[word] for word in sentencelist]
    return [];

def tensorFromSentence(lang, sentence, max_len):
        indexes = indexesFromSentence(lang, sentence)
        if(len(indexes) > max_len):
            indexes = indexes[:max_len]
        if(len(indexes) < max_len):
            indexes.extend([PAD_IDX]*(max_len-len(indexes)))
        indexes.append(EOS_token)
        indexes.insert(0, SOS_token)
        return torch.tensor(indexes, dtype=torch.long, device= device)#.view(-1, 1)

In [30]:
data = []
for i, r in df_train.iterrows():
    tensor_question_index = tensorFromSentence(lang, r['question_token'], max_question_word_count)
    answer_answer_index = tensorFromSentence(lang, r['answer_token'], max_answer_word_count)
    data.append((tensor_question_index, answer_answer_index))



In [31]:
#source_data = [tensorFromSentence(lang, sentencelist) for sentencelist in df_train['question_token']]

In [32]:
#target_data = [tensorFromSentence(lang, sentencelist) for sentencelist in df_train['answer_token']]  

In [33]:
#dev_source_data = [tensorFromSentence(lang, sentencelist) for sentencelist in dev_test['question_token']]

In [34]:
#dev_target_data = [tensorFromSentence(lang, sentencelist) for sentencelist in dev_test['answer_token']]  

In [35]:
#source_list = source_data[0:150]
#target_list = target_data[0:150]

In [36]:
PAD_IDX = lang.word2index['PAD_IDX']
SOS_IDX = lang.word2index['SOS']
EOS_IDX = lang.word2index['EOS']
print(PAD_IDX)
print(SOS_IDX)
print(EOS_IDX)

55002
55000
55001


In [37]:
data[0]

(tensor([55000,    71,   105,     5,   728,  2011,  1141, 55002, 55002, 55002,
         55002, 55001], device='cuda:0'),
 tensor([55000,     7,     0,   564,   404, 55002, 55001], device='cuda:0'))

In [38]:
train_iter = DataLoader(data, batch_size=10,
                        shuffle=False)

In [39]:
for i, (x, y) in enumerate(train_iter):
    print(x.squeeze(0))

tensor([[55000,    71,   105,     5,   728,  2011,  1141, 55002, 55002, 55002,
         55002, 55001],
        [55000,    79,   420,   105,     5,  4477,     7,    71,    58,    10,
          1025, 55001],
        [55000,    71,   105,     5,   485,    18,   461,     4,   270,     6,
         10641, 55001],
        [55000,     7,    79,   375,     4,   182,   105,     5,  1774,    61,
         55002, 55001],
        [55000,     7,    35,  2422,   105,     5,   270,  1232, 55002, 55002,
         55002, 55001],
        [55000,     7,    79,    31,   248,    10,    58,     0,   825,  9817,
         55002, 55001],
        [55000,    79, 11822,    97,    40,     6,    38,   400,  2053, 55002,
         55002, 55001],
        [55000,    53,  3109,     0,    18,   461,   248, 55002, 55002, 55002,
         55002, 55001],
        [55000,    71,   105,    43,  1077,     5,  5979, 55002, 55002, 55002,
         55002, 55001],
        [55000,    79,  1044,   105,    43,    33,     7,    18,   461, 5

In [40]:
import torch.nn as nn
import torch.optim as optim

class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [41]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

In [42]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        sos = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(sos, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

In [43]:
INPUT_DIM =  len(Questionlang.word2index) #len(SRC.vocab)
print(INPUT_DIM)
print(Questionlang.n_words)
OUTPUT_DIM = len(Answerlang.word2index) #len(TRG.vocab)
print(OUTPUT_DIM)
print(Answerlang.n_words)

403
403
181
181


In [44]:
INPUT_DIM =  len(Questionlang.word2index) #len(SRC.vocab)
OUTPUT_DIM = len(Answerlang.word2index) #len(TRG.vocab)

ENC_EMB_DIM = INPUT_DIM
DEC_EMB_DIM = OUTPUT_DIM
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

modelSeq2Seq = Seq2Seq(enc, dec, device).to(device)

In [45]:
#def init_weights(m):
#    for name, param in m.named_parameters():
#        nn.init.uniform_(param.data, -0.08, 0.08)
        
#modelSeq2Seq.apply(init_weights)

In [46]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(modelSeq2Seq):,} trainable parameters')

The model has 7,791,895 trainable parameters


In [47]:
#TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
 #ignore_index = TRG_PAD_IDX

In [48]:
optimizer = torch.optim.SGD(modelSeq2Seq.parameters(), lr=0.01)
criterion = nn.NLLLoss()

In [49]:
def train(model, train_iter, optimizer, criterion, clip):
    
    #model.to(device)
    epoch_loss = 0
    
    model.train()
        
    for _, (src, trg) in enumerate(train_iter):
        
        optimizer.zero_grad()
        output = model(src, trg)
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)    
        optimizer.step()
        epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [50]:
def evaluate(model, train_iter, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, (src, trg) in enumerate(train_iter):

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [51]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [52]:
import time
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(modelSeq2Seq, train_iter, optimizer, criterion, CLIP)
    valid_loss = evaluate(modelSeq2Seq, train_iter, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'chat-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR